Technical
How I Test Agent Outputs Without Spending All Day Reviewing
Agent outputs look plausible even when they are wrong. That is the whole problem. If you try to review every output by hand, you drown. I built a layered verification setup that catches most bugs without me reading every generation. Here is the pattern I use across every agent pipeline I ship.
The Three Layers
- Schema validation: does the output fit the expected shape?
- Rule assertions: do required tokens appear, forbidden tokens stay out?
- Eval sampling: spot-check 5 percent with a critic model
Each layer is cheap on its own. Together they catch the vast majority of failures without me reading a single generation by hand.
Schema First
from pydantic import BaseModel
class ArticleDraft(BaseModel):
title: str
slug: str
body: str
word_count: int
draft = ArticleDraft.model_validate_json(llm_output)Schema catches maybe 40 percent of failures instantly at zero cost. Missing fields, wrong types, obviously truncated outputs: all rejected before any human looks at them. For most pipelines this is the single highest-return check you can add.
Rules Catch the Obvious
I keep a file of regex rules per content type. No em dashes. Required link count. Min and max word counts. These run on every output and block publishing if they fail. Rules are boring, mechanical, and surprisingly effective because most agent failures are boring and mechanical.
Sampling for the Rest
The final layer is a critic model reviewing a sample. The critic cannot catch every failure but it finds the drift that schema and rules miss. Five percent sampling is enough to notice patterns before they compound into a library-wide problem. I rotate which five percent randomly so patterns cannot hide in a blind spot.
The Critic Prompt Pattern
You are a strict reviewer. For the output below,
return PASS or FAIL and one sentence of reasoning.
Fail if: voice is wrong, facts look wrong, structure is broken.Keep the critic prompt short. Long critic prompts start generating long critiques that do not correlate with actual failures.
The Combined Effect
Layered gates turn a one-hour hand-review into a fifteen-minute sampling pass. The math lets a solo operator ship production agent work without quality collapse. Without these layers, the failure rate would be silent and compounding. With them, I see problems the week they start rather than the month after.
RELATED READING
The Consulting Shift I Am Making In Year Two
After a year of writing and building, my consulting practice is changing shape. Shorter engagements. Sharper outcomes.
ReadThe Frontend Shift: Shipping Less JavaScript In Year Two
A year ago I reached for Next.js for everything. This year I often reach for nothing.
ReadThe Serverless Lesson I Would Write On A Sticky Note
After a year of shipping serverless projects, one rule explains most of the wins and all of the losses.
Read