Token Budget & Tool-Output Trimming
Token Budget & Tool-Output Trimming
Every token in the context window costs money, adds latency, and competes for the model's attention. A token budget is the discipline of treating the window as a scarce, allocated resource rather than a bottomless bucket. Reliability and cost are downstream of how well you spend it.
Where the tokens go
In agentic workloads, the dominant consumer is usually tool output, not the prompt. A single grep, a curl, or an MCP query can return thousands of tokens of which a handful matter. Because every tool result is appended to history and re-sent on the next turn, raw tool output is paid for repeatedly, multiplied across the loop.
Trim at the source
Strip tool output before it enters the conversation. Good targets:
- Pagination headers, ANSI color codes, progress bars, timestamps.
- Repeated boilerplate (license banners, identical row prefixes).
- Fields the task does not use (return
{ id, status }, not the full 40-field object). - Caps: return the first/most-relevant N rows plus a count, not all 10,000.
Trimming is lossless for the reasoning when you keep what the task needs and drop only noise — strictly better than summarizing later.
Delegate verbose discovery to a subagent
When discovery is inherently noisy (reading 30 files to find one function), run it in a subagent with its own context window. The subagent burns tokens exploring, then returns a compact result; the parent never sees the raw firehose. This isolates verbose work and keeps the parent's window clean — the cleanest way to protect a token budget.
A simple budgeting model
Allocate the window into named regions and enforce ceilings:
When a region exceeds its ceiling, evict (drop oldest tool results, re-trim, or compact) rather than letting it crowd out instructions and recent turns — which would also trigger lost-in-the-middle.
Cache the stable prefix
Tokens you cannot remove, you can often cache. A long stable system prompt and tool definitions can be marked with prompt caching so repeat turns pay ~0.1x for that prefix instead of full price. (Covered in depth in the prompt-caching lesson.) Budgeting and caching are complementary: trim the variable part, cache the stable part.
Measure, don't guess
Use the API usage fields and Claude Code's context indicators to see real consumption: input_tokens, cache_read_input_tokens, output_tokens. Optimize the biggest line item first — usually tool output.
Exam focus
The dominant token cost in agents is tool output, re-sent every turn. Prefer, in order: (1) trim at the source (lossless), (2) delegate verbose discovery to a subagent to isolate the firehose, (3) cache the stable prefix, and only then (4) summarize. Budget the window into regions with ceilings and evict to protect instructions and recent turns from lost-in-the-middle.