How I Test Agent Outputs Without Spending All Day Reviewing

Agent outputs look plausible even when they are wrong. That is the whole problem. If you try to review every output by hand, you drown. I built a layered verification setup that catches most bugs without me reading every generation. Here is the pattern I use across every agent pipeline I ship.

The Three Layers

Schema validation: does the output fit the expected shape?
Rule assertions: do required tokens appear, forbidden tokens stay out?
Eval sampling: spot-check 5 percent with a critic model

Each layer is cheap on its own. Together they catch the vast majority of failures without me reading a single generation by hand.

Schema First

python

from pydantic import BaseModel
 
class ArticleDraft(BaseModel):
    title: str
    slug: str
    body: str
    word_count: int
 
draft = ArticleDraft.model_validate_json(llm_output)

Schema catches maybe 40 percent of failures instantly at zero cost. Missing fields, wrong types, obviously truncated outputs: all rejected before any human looks at them. For most pipelines this is the single highest-return check you can add.

Rules Catch the Obvious

I keep a file of regex rules per content type. No em dashes. Required link count. Min and max word counts. These run on every output and block publishing if they fail. Rules are boring, mechanical, and surprisingly effective because most agent failures are boring and mechanical.

Sampling for the Rest

The final layer is a critic model reviewing a sample. The critic cannot catch every failure but it finds the drift that schema and rules miss. Five percent sampling is enough to notice patterns before they compound into a library-wide problem. I rotate which five percent randomly so patterns cannot hide in a blind spot.

The Critic Prompt Pattern

plaintext

You are a strict reviewer. For the output below,
return PASS or FAIL and one sentence of reasoning.
Fail if: voice is wrong, facts look wrong, structure is broken.

Keep the critic prompt short. Long critic prompts start generating long critiques that do not correlate with actual failures.

The Combined Effect

Layered gates turn a one-hour hand-review into a fifteen-minute sampling pass. The math lets a solo operator ship production agent work without quality collapse. Without these layers, the failure rate would be silent and compounding. With them, I see problems the week they start rather than the month after.