Technical
Why I Moved Everything to Haiku 4.5 for Triage
Solo operators cannot afford to run opus on every request. In 2025 I was burning money on model overkill for tasks that did not need it. Over the break I refactored my whole stack to put Haiku 4.5 in front of every agent call and let it decide when to escalate. The bill dropped by 62 percent and quality held. Here is the exact pattern I use now.
The Triage Pattern
Every incoming task hits a Haiku classifier first. It returns one of three labels:
trivial: handled inline by Haikustandard: routed to Sonnethard: escalated to Opus with full context
The classifier runs in roughly 200 milliseconds and costs fractions of a cent. It is basically free on the economics, and the savings on downstream calls are huge.
A Concrete Example
def triage(task: str) -> str:
resp = haiku.complete(
system='classify as trivial, standard, or hard',
prompt=task,
)
return resp.strip().lower()
def run(task: str) -> str:
label = triage(task)
if label == 'trivial':
return haiku.complete(prompt=task)
if label == 'standard':
return sonnet.complete(prompt=task)
return opus.complete(prompt=task)What Surprised Me
Roughly 70 percent of requests came back trivial. I had been paying Opus rates for tasks Haiku handles in one shot. That is the kind of realization you only get when you instrument the traffic. Another 20 percent were standard. Only 10 percent genuinely needed Opus. The old flat-model setup was paying top-tier rates for nine out of ten requests that did not need them.
The Escalation Rule
When Haiku is unsure, it must escalate. Uncertainty is cheap at the triage layer and expensive at the output layer. I wrote the prompt to bias toward escalation on ambiguity. Better to spend a few extra cents on a wrong escalation than to ship a bad output from a too-small model.
The Observability Part
I log every triage decision and sample outputs weekly. If Haiku classifies a class of tasks as trivial but the outputs are drifting, I catch it in the weekly review and tune the classifier. Without that observability loop, quality would erode silently.
When Not to Triage
If your workload is tiny or uniformly hard, the triage layer adds latency without saving money. Measure before you ship this pattern. The Anthropic models guide has current pricing and context windows so you can run your own break-even math.
RELATED READING
The Consulting Shift I Am Making In Year Two
After a year of writing and building, my consulting practice is changing shape. Shorter engagements. Sharper outcomes.
ReadThe Frontend Shift: Shipping Less JavaScript In Year Two
A year ago I reached for Next.js for everything. This year I often reach for nothing.
ReadThe Serverless Lesson I Would Write On A Sticky Note
After a year of shipping serverless projects, one rule explains most of the wins and all of the losses.
Read