Technical
Agent Failure Recovery Patterns: Designing for When Agents Break
Every production agent I have shipped has failed at least once. The question that matters is not whether your agent can fail. It is what your system looks like after.
The Four Failure Classes
- Clean abort: agent raised, no side effects
- Partial write: agent wrote some artifacts, then failed
- Silent drift: agent succeeded but produced wrong output
- Cascade: one agent failing breaks another downstream
Each class needs a different recovery pattern.
Recovery for Clean Aborts
Retry with exponential backoff. Log the error. Move on. This class is easy.
Recovery for Partial Writes
Every write wraps in a staging-commit pattern. Writes go to a staging location first. Only after all writes succeed does the agent commit to the final location. If the agent dies mid-write, the staging location is garbage-collected. No partial state visible to consumers.
staging = f'/tmp/agent-{run_id}/'
# write all artifacts to staging
# on success: shutil.move(staging, final)
# on failure: shutil.rmtree(staging)Recovery for Silent Drift
This is the hardest. The agent returned success and the output is wrong. The only defense is a verification step that the coordinator runs after the agent reports done. No verification, no trust.
Recovery for Cascades
Circuit breakers. If one agent fails three times in a row, the coordinator stops spawning downstream agents that depend on it. The failure is contained at the earliest point rather than propagating through the graph.
The Observability Minimum
For every agent run I log: run_id, start_time, end_time, status, input_hash, output_hash, tokens_used, cost_usd, error_class. Without these fields I cannot diagnose failures a week later. With them I can build dashboards that reveal the pattern.
The Replay Pattern
Every failed run is replayable from its inputs. The agent reads from a snapshot of its input state, not from live data. This lets me fix a bug and re-run the exact scenario that failed. AWS DLQ patterns are the model I borrow from.
Design for failure. Agents will give you plenty of opportunities to practice.
RELATED READING
The Consulting Shift I Am Making In Year Two
After a year of writing and building, my consulting practice is changing shape. Shorter engagements. Sharper outcomes.
ReadThe Frontend Shift: Shipping Less JavaScript In Year Two
A year ago I reached for Next.js for everything. This year I often reach for nothing.
ReadThe Serverless Lesson I Would Write On A Sticky Note
After a year of shipping serverless projects, one rule explains most of the wins and all of the losses.
Read