What Breaks In Production Agents

Every agent I have put in production has broken at some point. The breakages are not random. They cluster around three failure modes.

The business problem is trust. If a client has to babysit an agent, they will cancel the contract. Agents earn their keep by running unattended. Failures that surface loudly are the whole point.

Failure one: tool timeouts

Your agent calls an API. The API is slow that day. The default timeout is way too long. The agent hangs. The user waits. The log says nothing.

Fix: every tool call wrapped in an explicit timeout shorter than the agent step budget.

python

import requests
def safe_call(url, timeout=10):
    try:
        return requests.get(url, timeout=timeout).json()
    except requests.Timeout:
        return {'error': 'timeout', 'url': url}

Failure two: fuzzy stop conditions

The agent is told to stop when the task is done. The LLM cannot decide when done is done. The loop runs to max steps. Your bill doubles.

Fix: stop conditions must be checkable by code. Queue empty. File written. HTTP 200 received.

Failure three: unhandled errors poisoning the trajectory

One tool raises. The next step receives a traceback as context. The agent tries to reason about it and makes things worse.

Fix: every error converted to a short structured string before it goes back into the agent's history. Never let raw stack traces into context.

The watchlist

Max steps hard cap
Max wall clock hard cap
Every tool call logged with latency
Errors structured, not raw
Stop conditions checkable by code

The story

I once had an agent run for seventeen minutes because a single tool call was silently retrying. The client noticed first. That was the last time I shipped without a wall clock cap.

What Breaks In Production Agents

Failure one: tool timeouts

Failure two: fuzzy stop conditions

Failure three: unhandled errors poisoning the trajectory

The watchlist

The story

Further reading

The Consulting Shift I Am Making In Year Two

The Frontend Shift: Shipping Less JavaScript In Year Two

The Serverless Lesson I Would Write On A Sticky Note