Scenario 1 — Customer Support Agent

The problem

A SaaS company wants a Claude-powered agent that resolves Tier-1 support tickets: look up an order, check a subscription, issue a refund within policy, and escalate anything it cannot safely handle. Conversations can run 20+ turns, the knowledge base is large, and a wrong refund costs real money. You must design an architecture that is reliable over long sessions, selects the right tool every time, and escalates instead of guessing.

The right architecture

This is a single-agent agentic loop, not a multi-agent swarm. The model receives a system prompt (role, policy, escalation rules), a set of tools, and the running conversation. Each turn Claude either replies to the user (stop_reason: "end_turn") or requests a tool (stop_reason: "tool_use"). Your harness executes the tool, appends a tool_result block, and calls the Messages API again. The loop continues until end_turn.

Define tools narrowly and idempotently:

json

The description is the selection mechanism — Claude picks tools by reading them, so encode preconditions and limits there. Keep tool_choice: "auto" so the model can also just talk; forcing a tool every turn breaks conversation. Reserve forced tool_choice for narrow sub-tasks (e.g. always classify the ticket first).

Key decisions

Model-driven, not a decision tree. Support intent is open-ended; let the model orchestrate tool calls rather than hard-coding an if/else flow. But guard money-moving actions deterministically: a hook / code check should re-verify the refund amount and window before the API call actually runs. Prompts are probabilistic; hooks are deterministic — put the hard money limit in the hook.
Structured tool errors. When get_order finds nothing, return { "isError": false, "result": "no matching order" } (a valid empty result) — not a thrown error. Distinguish that from an access failure (isError: true, "database unreachable, retry"). The model behaves very differently for "nothing found" vs "I couldn't look": one means ask the customer for another ID, the other means retry or escalate.
Human-in-the-loop escalation. Refunds over the cap, angry customers, or anything outside policy hand off to a human queue. Escalation is part of orchestration, not a failure mode.
Long-session reliability. A 20-turn chat plus full KB will overflow or drift. Trim verbose tool output before appending it (return the 3 relevant KB chunks, not 50). Keep durable facts (customer ID, verified order, decisions taken) in a small JSON state snapshot the agent re-reads, so important context survives summarization and avoids the lost-in-the-middle problem.
Token economics. The policy prompt and tool definitions are stable across every ticket — put a prompt-caching breakpoint on the system block so tools + system are reused. The cache hierarchy is Tools → System → Messages, max 4 breakpoints; cache reads cost ~0.1x input, so this is a large win at support volume.

Common traps

Overlapping tools (lookup_order vs get_order_details) make selection ambiguous — consolidate.
Returning raw stack traces as tool results pollutes context and confuses the model — return structured, summarized errors.
Letting the model decide the refund cap "by reading the prompt" — a jailbroken or confused turn issues a $5,000 refund. Enforce limits in code.
Stuffing the entire knowledge base into every request — costly and triggers lost-in-the-middle. Retrieve and trim.
Treating "no order found" as an exception that aborts the loop instead of a normal, recoverable result.

How it maps to the domains

Domain 1 (Agentic Architecture & Orchestration): the agentic loop, stop_reason handling, model-driven vs hard-coded control, hooks for determinism, and human-in-the-loop escalation placement.
Domain 2 (Tool Design & MCP Integration): descriptions as the selection mechanism, tool_choice usage, idempotent narrow tools, and structured isError results (transient vs validation vs permission). The order/refund backends are natural MCP servers.
Domain 5 (Context Management & Reliability): trimming tool output, JSON state snapshots across context boundaries, distinguishing access failure from valid empty result, and prompt caching for token budget on a stable prompt.

Exam focus: when a question pairs long support conversations with money-moving actions, the correct answer is almost always "let the model orchestrate tool calls, but enforce the hard limit with a deterministic hook, return structured tool errors, and escalate over the cap" — not "force a tool every turn" and not "trust the prompt to hold the limit."