Context Management & Reliability

Token Budget & Tool-Output Trimming

11 min de leitura

Token Budget & Tool-Output Trimming

Every token in the context window costs money, adds latency, and competes for the model's attention. A token budget is the discipline of treating the window as a scarce, allocated resource rather than a bottomless bucket. Reliability and cost are downstream of how well you spend it.

Where the tokens go

In agentic workloads, the dominant consumer is usually tool output, not the prompt. A single grep, a curl, or an MCP query can return thousands of tokens of which a handful matter. Because every tool result is appended to history and re-sent on the next turn, raw tool output is paid for repeatedly, multiplied across the loop.

Trim at the source

Strip tool output before it enters the conversation. Good targets:

  • Pagination headers, ANSI color codes, progress bars, timestamps.
  • Repeated boilerplate (license banners, identical row prefixes).
  • Fields the task does not use (return { id, status }, not the full 40-field object).
  • Caps: return the first/most-relevant N rows plus a count, not all 10,000.
text

Trimming is lossless for the reasoning when you keep what the task needs and drop only noise — strictly better than summarizing later.

Delegate verbose discovery to a subagent

When discovery is inherently noisy (reading 30 files to find one function), run it in a subagent with its own context window. The subagent burns tokens exploring, then returns a compact result; the parent never sees the raw firehose. This isolates verbose work and keeps the parent's window clean — the cleanest way to protect a token budget.

A simple budgeting model

Allocate the window into named regions and enforce ceilings:

text

When a region exceeds its ceiling, evict (drop oldest tool results, re-trim, or compact) rather than letting it crowd out instructions and recent turns — which would also trigger lost-in-the-middle.

Cache the stable prefix

Tokens you cannot remove, you can often cache. A long stable system prompt and tool definitions can be marked with prompt caching so repeat turns pay ~0.1x for that prefix instead of full price. (Covered in depth in the prompt-caching lesson.) Budgeting and caching are complementary: trim the variable part, cache the stable part.

Measure, don't guess

Use the API usage fields and Claude Code's context indicators to see real consumption: input_tokens, cache_read_input_tokens, output_tokens. Optimize the biggest line item first — usually tool output.

Exam focus

The dominant token cost in agents is tool output, re-sent every turn. Prefer, in order: (1) trim at the source (lossless), (2) delegate verbose discovery to a subagent to isolate the firehose, (3) cache the stable prefix, and only then (4) summarize. Budget the window into regions with ceilings and evict to protect instructions and recent turns from lost-in-the-middle.