Claude Platform Foundations

Streaming & Prompt Caching

12 min de leitura

Two features that every production Claude app eventually needs: streaming (so users see output as it's generated) and prompt caching (so you stop paying full price to re-send the same prefix). They solve different problems but both show up on the exam.

Streaming

By default, /v1/messages returns the whole message at once. With stream: true the API emits Server-Sent Events (SSE) as tokens are produced.

Why it matters beyond UX: a non-streaming request with a large max_tokens can exceed the SDK's HTTP timeout and fail. Stream any request with max_tokens above ~16K. Opus models can emit up to 128K output tokens, but only via streaming.

The key event sequence:

You accumulate content_block_delta events to build the output. The SDKs expose a helper — stream.get_final_message() / stream.finalMessage() — that reassembles the complete message for you, so you get streaming UX and the full object to inspect stop_reason and usage.

Prompt caching

Caching lets you reuse an expensive prefix across requests. The one rule that explains all behavior: caching is a prefix match. The cache key is the exact bytes of the rendered prompt up to each breakpoint, in render order tools -> system -> messages. Any byte change anywhere in the prefix invalidates everything after it.

Mark a breakpoint with cache_control:

json
  • Default TTL is 5 minutes; {"type": "ephemeral", "ttl": "1h"} extends it to 1 hour.
  • Max 4 breakpoints per request.
  • There is a minimum cacheable prefix (4096 tokens on Opus/Haiku 4.5, fewer on some Sonnet). Below it, nothing caches — silently.

Economics

Cache reads cost ~0.1x of base input price; cache writes cost ~1.25x (5-min TTL) or ~2x (1-hour). So a cached prefix pays off after just a couple of reuses.

Verifying it works

Check usage on the response:

FieldMeaning
cache_creation_input_tokenstokens written to cache (paid ~1.25x)
cache_read_input_tokenstokens served from cache (paid ~0.1x)
input_tokensuncached remainder (full price)

If cache_read_input_tokens stays zero across identical-prefix requests, a silent invalidator is at work — a datetime.now() or UUID in the system prompt, non-deterministic JSON key ordering, or a tool list that varies per request. Keep volatile content (timestamps, the per-request question) after the last breakpoint.

Exam focus

Stream for max_tokens > ~16K to avoid timeouts; know the event names (content_block_delta carries the text) and the get_final_message() helper. For caching, recite the prefix-match invariant and render order (tools -> system -> messages). Know cache_control: {type: "ephemeral"}, the 5-min/1-hour TTLs, the 4-breakpoint cap, and that a zero cache_read_input_tokens signals a silent invalidator — keep stable content first, volatile content last.