Agent Failure Recovery Patterns: Designing for When Agents Break

Every production agent I have shipped has failed at least once. The question that matters is not whether your agent can fail. It is what your system looks like after.

The Four Failure Classes

Clean abort: agent raised, no side effects
Partial write: agent wrote some artifacts, then failed
Silent drift: agent succeeded but produced wrong output
Cascade: one agent failing breaks another downstream

Each class needs a different recovery pattern.

Recovery for Clean Aborts

Retry with exponential backoff. Log the error. Move on. This class is easy.

Recovery for Partial Writes

Every write wraps in a staging-commit pattern. Writes go to a staging location first. Only after all writes succeed does the agent commit to the final location. If the agent dies mid-write, the staging location is garbage-collected. No partial state visible to consumers.

python

staging = f'/tmp/agent-{run_id}/'
# write all artifacts to staging
# on success: shutil.move(staging, final)
# on failure: shutil.rmtree(staging)

Recovery for Silent Drift

This is the hardest. The agent returned success and the output is wrong. The only defense is a verification step that the coordinator runs after the agent reports done. No verification, no trust.

Recovery for Cascades

Circuit breakers. If one agent fails three times in a row, the coordinator stops spawning downstream agents that depend on it. The failure is contained at the earliest point rather than propagating through the graph.

The Observability Minimum

For every agent run I log: run_id, start_time, end_time, status, input_hash, output_hash, tokens_used, cost_usd, error_class. Without these fields I cannot diagnose failures a week later. With them I can build dashboards that reveal the pattern.

The Replay Pattern

Every failed run is replayable from its inputs. The agent reads from a snapshot of its input state, not from live data. This lets me fix a bug and re-run the exact scenario that failed. AWS DLQ patterns are the model I borrow from.

Design for failure. Agents will give you plenty of opportunities to practice.