Per-field confidence
Wrap each field in a confidence carrier so a downstream check can decide what needs review. The schema is identical to the one in data labeling; the routing logic is the part that lives here.Route on low confidence
The trigger is plain Python. Walk the fields, find anything below threshold, and decide what to do with it.Gate the next action with requires_confirmation
For a tighter loop, wrap the downstream action (the database write, the ERP push) in a tool that requires approval, and only invoke it when confidence is high. The agent pauses on low confidence and a human can release the run.
run.run_id is persisted in db, so the approval can come from a different process minutes or hours later. See human approval for the full surface, including async variants and listing pending approvals from the database.
Accuracy against a golden set
Confidence routes individual documents. Eval tells you whether the system as a whole is still extracting what it should. Build a small golden set (50 to a few hundred labeled documents) and grade the agent against it.AccuracyEval runs the agent num_iterations times against the same input, asks a grader model to score each run against the expected output, and reports the average. Loop the call over your golden set to get a per-document score.
db you use for runs, and you have a regression signal. A drop in average score after a model swap or prompt change tells you the new configuration is worse before it reaches production. See the evals cookbook for db_logging and the team variant.
Two patterns, one job
| Pattern | What it answers | When it fires |
|---|---|---|
| Confidence routing | ”Which fields on this document need a human?” | Every run, per document |
| Approval-gated tools | ”Should we let the agent take the next action?” | At a specific tool boundary |
| AccuracyEval over a golden set | ”Is the extractor still as accurate as last week?” | On CI, after a prompt or model change, on a schedule |
Next steps
| Task | Guide |
|---|---|
| Schedule the eval to run nightly | Batch and durability |
| Approve from an external UI | Human approval |
| Add a two-labeler review step | Quality pipeline |