도메인 5 · 시험 비중 15%

Domain 5 Study Guide: Context Management & Reliability

Domain 5: Context Management & Reliability

Domain 5 is 15% of the CCA-Foundations exam. It is the domain that separates a demo from a production system. A single prompt that works once tells you nothing about whether a multi-step agent stays correct after fifty tool calls, three summarization passes, and a context reset. This domain tests how you engineer the context window, propagate errors honestly, annotate provenance and coverage, persist state across boundaries, and control token cost while keeping a human in the loop where it matters.

Blueprint at a glance

The objectives cluster into three areas, mirroring the course modules:

Context engineering — fighting lost-in-the-middle, summarizing without losing facts, and budgeting the window.
Reliability & provenance — structured error propagation, coverage/citation annotations, and durable scratchpads.
Quality assurance & cost — stratified human review, blocking vs post-hoc gates, and prompt caching.

Context engineering

Lost-in-the-middle

Recall across a long context is U-shaped: a model attends most strongly to the beginning and end of its input and weakest to the middle. A 200K-token window does not mean 200K tokens of reliable recall. The architectural responses are:

Place the most critical instructions and highest-relevance context at the ends of the prompt.
Restate key constraints near the end, just before the model must act.
Reduce middle bloat by trimming and by delegating verbose discovery to subagents — do not rely on a big window to "remember everything."

A bigger context window is a capacity, not a guarantee. The exam consistently rewards answers that engineer placement over answers that just add more tokens.

Progressive summarization

When a conversation outgrows the window, you compact older turns into a summary. This is necessary but lossy — every pass discards detail and errors compound across repeated summarization. Best practice:

Keep durable facts on disk (a scratchpad or JSON snapshot), and only summarize prose.
Trim removable tool noise first, then summarize what remains.
Treat summarization as a last resort behind lossless trimming and subagent delegation, never as a way to recover lost detail.

Token budget

In an agent, the dominant cost is usually tool output re-sent every turn, not the system prompt. Budget the window into regions (instructions, recent turns, retrieved context, tool output) with ceilings, and evict to protect instructions and recent turns. The optimization order is:

Trim at the source (lossless) — strip null fields, truncate logs, keep only needed columns.
Delegate verbose discovery to a subagent to isolate the firehose from the parent window.
Cache the stable prefix (see below).
Summarize only as a final step.

Reliability & provenance

Structured error propagation

The single most important reliability rule in this domain: never conflate an access failure with a valid empty result. "I could not read the file" and "the file is empty" are different facts and must be different outcomes. In the Messages API, a tool returns a failure with is_error: true and a structured, typed payload:

json

Classify errors (transient / validation / permission / not_found) so the model can react correctly — retry a transient timeout, escalate a permission error. Propagate failures up the stack; do not swallow them or silently substitute a default. Fire escalation triggers — a non-retryable error, an exhausted retry budget, a high-stakes or irreversible action, or non-convergence — to hand off to a human instead of guessing.

Provenance and coverage annotations

Trustworthy agent output carries provenance (source, tool, line, timestamp — ideally real citations) and coverage (how much was examined vs. the total, what was skipped, and a confidence signal). Require these in the output schema, not as optional prose the model may forget. Coverage is what makes an empty result believable: it lets you distinguish "verified absent" from "not examined," the same honesty principle as the access-failure-vs-valid-empty distinction.

json

Scratchpads and durable state

The context window is ephemeral — compacted, reset, and not shared across subagent boundaries. Durable state belongs in files, not the window:

A scratchpad (markdown TODO / decisions) for human-readable working memory.
A JSON state snapshot (cursor, phase, processed IDs) for resumable machine state.

Write state after each unit of progress and make resume idempotent so a restart re-reads the snapshot instead of redoing work. Files are also the clean subagent handoff channel, isolating verbose discovery from the parent's window.

Quality assurance and cost

Stratified human review

You cannot human-review everything in a high-volume pipeline, and random sampling under-represents rare, high-risk cases. Use stratified sampling to over-sample risky strata. Choose the gate by stakes:

Block (synchronously) on high-stakes / irreversible actions — a human must approve before execution.
Post-hoc sample for quality and drift monitoring on low-stakes output.

Auto-route every escalation and is_error case to review, and run the non-blocking audit through the Message Batches API — it is cheaper and asynchronous. Never use a batch for the blocking gate, because batches are not real-time.

Prompt caching

Prompt caching is the top token-budget lever for long, stable prefixes. The cacheable prefix follows the order tools → system → messages, and the cache matches the longest exact prefix — so anything that changes invalidates that block and everything after it. The rule: stable content first, volatile content (timestamps, IDs, the new turn) last. Add cache_control: { type: "ephemeral" } to the last block of each stable section (up to 4 breakpoints). Reads cost ~0.1x input (a 90% discount); writes ~1.25x (5-min TTL) or ~2x (1-hour TTL). Below the minimum length it silently no-ops — verify with cache_read_input_tokens in the usage object.

How it ties together

Domain 5 is a single coherent idea: a long-running system must stay honest about what it knows and degrade gracefully when it doesn't. Engineer placement against lost-in-the-middle, prefer lossless trimming over lossy summarization, propagate typed errors instead of guessing, annotate provenance and coverage so outputs are auditable, persist durable state in files, sample reviews by risk, and cache the stable prefix to make all of it affordable.

시험 팁

✓When a question describes degraded recall in a long prompt, the answer is to engineer placement — put critical instructions/context at the start and end and restate constraints near the end — not "use a larger context window."
✓Treat "access failure" and "valid empty result" as different outcomes. Any option that returns a default, empty list, or "not found" for a permission/timeout error is wrong; the correct answer uses is_error: true with a typed payload and propagates it.
✓Order optimizations: trim at the source (lossless) first, then delegate verbose discovery to a subagent, then prompt-cache the stable prefix, and only summarize as a last resort. Summarization is lossy and errors compound.
✓Prompt caching prefix order is tools -> system -> messages, matched on the longest exact prefix. Volatile content (timestamps, IDs, the new turn) must go LAST or it invalidates the cache.
✓Reads ~0.1x, writes ~1.25x (5-min) or ~2x (1-hour TTL); below the minimum cacheable length caching silently no-ops. Verify it works by reading cache_read_input_tokens in the usage object.
✓Use blocking (synchronous) human review for high-stakes/irreversible actions and post-hoc stratified sampling for quality/drift monitoring. Run the non-blocking audit via the Message Batches API — never the blocking gate.
✓Provenance (citations/source/timestamp) and coverage (examined vs total, skipped, confidence) belong in the output schema as required fields, not optional prose. Coverage is what makes an empty result trustworthy.
✓Durable state goes in files (scratchpad + JSON snapshot), not the context window, because the window is compacted/reset and not shared across subagent boundaries. Make resume idempotent.

안티패턴

✗"Just use the 200K context window so the model remembers everything." Wrong — recall is U-shaped (lost-in-the-middle); a large window is capacity, not a recall guarantee, and middle content is the least reliable.
✗Swallowing a tool failure and returning a default value or empty result. Wrong — it erases the distinction between "could not access" and "genuinely empty," causing the agent to act confidently on a false premise.
✗Summarizing aggressively to save tokens as the first move. Wrong — summarization is lossy and errors compound across passes; lossless trimming and subagent delegation should come first, with durable facts kept on disk.
✗Putting a timestamp, request ID, or the latest user turn near the top of the prompt to "give the model fresh context." Wrong — volatile content high in the tools->system->messages prefix invalidates the entire prompt cache.
✗Assuming caching is active without checking. Wrong — below the minimum cacheable length it silently no-ops with no error; you must confirm via cache_read_input_tokens. A changing prefix keeps reads at 0.
✗Random-sampling outputs for quality review in a high-volume pipeline. Wrong — random sampling under-represents rare, high-risk cases; use stratified sampling that over-samples risky strata.
✗Using the Message Batches API as a blocking approval gate before a high-stakes action. Wrong — batches are asynchronous and not real-time; high-stakes/irreversible actions need a synchronous human-in-the-loop gate.
✗Storing long-running agent state in the conversation history and expecting it to survive. Wrong — the window is compacted/reset and isolated per subagent; resumable state must be persisted to files (JSON snapshot) with idempotent resume.