Technical
Agent Observability: Seeing What Your Agents Actually Do
If you cannot see what your agent is doing, you cannot fix it when it breaks. Observability is not optional for production agents. It is the difference between a system you can run for a year and one you take down after a bad week. Here is the minimum I ship with every agent deployment.
The Four Layers
My agent observability has four layers, roughly in order of importance:
- Run logs: every model call, every tool call, every decision, stamped with a run ID
- Cost tracking: tokens and dollars per run, per user, per endpoint
- Failure alerts: structured errors piped to a channel I actually read
- Replay capability: the ability to re-run a past session from the logs
If you have only the first two, you can investigate incidents after the fact. Add the third, you catch problems while they are small. Add the fourth, you can debug without reproducing in production.
What Run Logs Contain
{
"run_id": "r_2025_10_09_abc123",
"user_id": "u_456",
"timestamp": "2025-10-09T14:22:01Z",
"step": 3,
"type": "tool_call",
"tool": "search_posts",
"args": {"query": "serverless"},
"duration_ms": 412,
"result_summary": "8 matches"
}Every line has the run ID so I can pull a complete session with a single query. I do not log the full prompt or full tool output at this layer. That goes to object storage behind a signed URL so sensitive content does not sit in CloudWatch forever.
The Dashboard That Matters
I do not build custom dashboards. I pipe everything into CloudWatch and write three saved queries:
- Active runs by endpoint: spot regressions in request volume
- Cost by user: catch runaway agents or abusive users
- Errors in last hour: the one I check before every deploy
That is enough. Fancy dashboards are fun to build and rarely get looked at. Three saved queries you actually run are worth more than a Grafana stack you do not.
The Cost of Skipping Observability
I shipped my first agent without proper logging. When a user reported a weird output I had nothing. I could not reproduce, could not explain, could not fix. That single incident cost me a day of work and some credibility. Observability is insurance. Pay the premium before you need to file the claim.
Read the CloudWatch Logs Insights docs for query patterns.
RELATED READING
The Consulting Shift I Am Making In Year Two
After a year of writing and building, my consulting practice is changing shape. Shorter engagements. Sharper outcomes.
ReadThe Frontend Shift: Shipping Less JavaScript In Year Two
A year ago I reached for Next.js for everything. This year I often reach for nothing.
ReadThe Serverless Lesson I Would Write On A Sticky Note
After a year of shipping serverless projects, one rule explains most of the wins and all of the losses.
Read