Agent Observability: Seeing What Your Agents Actually Do

If you cannot see what your agent is doing, you cannot fix it when it breaks. Observability is not optional for production agents. It is the difference between a system you can run for a year and one you take down after a bad week. Here is the minimum I ship with every agent deployment.

The Four Layers

My agent observability has four layers, roughly in order of importance:

Run logs: every model call, every tool call, every decision, stamped with a run ID
Cost tracking: tokens and dollars per run, per user, per endpoint
Failure alerts: structured errors piped to a channel I actually read
Replay capability: the ability to re-run a past session from the logs

If you have only the first two, you can investigate incidents after the fact. Add the third, you catch problems while they are small. Add the fourth, you can debug without reproducing in production.

What Run Logs Contain

json

{
  "run_id": "r_2025_10_09_abc123",
  "user_id": "u_456",
  "timestamp": "2025-10-09T14:22:01Z",
  "step": 3,
  "type": "tool_call",
  "tool": "search_posts",
  "args": {"query": "serverless"},
  "duration_ms": 412,
  "result_summary": "8 matches"
}

Every line has the run ID so I can pull a complete session with a single query. I do not log the full prompt or full tool output at this layer. That goes to object storage behind a signed URL so sensitive content does not sit in CloudWatch forever.

The Dashboard That Matters

I do not build custom dashboards. I pipe everything into CloudWatch and write three saved queries:

Active runs by endpoint: spot regressions in request volume
Cost by user: catch runaway agents or abusive users
Errors in last hour: the one I check before every deploy

That is enough. Fancy dashboards are fun to build and rarely get looked at. Three saved queries you actually run are worth more than a Grafana stack you do not.

The Cost of Skipping Observability

I shipped my first agent without proper logging. When a user reported a weird output I had nothing. I could not reproduce, could not explain, could not fix. That single incident cost me a day of work and some credibility. Observability is insurance. Pay the premium before you need to file the claim.

Read the CloudWatch Logs Insights docs for query patterns.

Agent Observability: Seeing What Your Agents Actually Do

The Four Layers

What Run Logs Contain

The Dashboard That Matters

The Cost of Skipping Observability

The Consulting Shift I Am Making In Year Two

The Frontend Shift: Shipping Less JavaScript In Year Two

The Serverless Lesson I Would Write On A Sticky Note