Scenario 8: Agentic AI Tools

The problem

A fintech team wants an internal "ops agent" that engineers invoke in chat to investigate and remediate incidents. A typical request: "Payments are failing in eu-west-1 — find out why and, if it's the rate limiter, raise the limit by 20%." The agent must read logs, query a metrics service, inspect a feature-flag store, and — only with approval — push a config change. The first prototype was unreliable: Claude sometimes guessed instead of querying, sometimes called the wrong tool, and once changed a production flag without asking. The architecture, not the model, was the problem.

This scenario is about turning a pile of capabilities into a trustworthy agentic system — the intersection of how the agent loops (Domain 1) and how each tool is contracted (Domain 2).

The right architecture

Model this as one agentic loop, not a chain of prompts. You give Claude the goal plus a set of tools and let it run the standard cycle: gather context → take action → verify → repeat. Each turn is one Messages API round-trip. Claude emits tool_use blocks and stops with stop_reason: "tool_use"; your application code runs the tool and appends a tool_result block in the next user message; you call the API again until Claude finishes with stop_reason: "end_turn". The single most-tested fact: the model requests tools, it never executes them — execution, the loop, and the termination check are yours.

Tools split into two trust tiers, and the split drives the design:

Read tools (get_logs, query_metrics, read_flag) — safe, idempotent, run with tool_choice: "auto" so the model decides when it has enough evidence.
Write tools (set_flag, scale_service) — irreversible side effects. These get human-in-the-loop (HITL) approval before execution.

The decisive insight: HITL is enforced in your code, not in the prompt. When Claude requests set_flag, the loop pauses before execution and surfaces the proposed call (name + arguments) to the engineer. "Please ask before changing prod" in the system prompt is a probabilistic hope; a code gate is a deterministic guarantee. This is the hooks-vs-prompts distinction from Domain 1: use deterministic control for anything that must always happen.

Tool contracts that the model can use (Domain 2)

The tool description is the selection mechanism — the model picks tools by reading their descriptions, so write them for an LLM, not a human:

json

Note the explicit "does NOT… use X instead" lines: overlapping or vague descriptions are the #1 cause of wrong tool selection. Enums constrain the model to valid inputs and let it self-correct.

Equally important: return structured errors, not prose. When metrics are unavailable, return a tool_result with is_error: true and a typed reason (transient → retry, validation → fix arguments, permission → escalate). Distinguish an access failure from a valid empty result: "no errors found" and "I couldn't reach the metrics API" must look different, or Claude will conclude the system is healthy when it is actually blind.

Common traps

Hard-coding the workflow as a fixed chain. Incidents are open-ended; a rigid pipeline can't adapt. Let the model decide the next read — but keep writes gated.
Trusting the prompt for safety. "Don't touch prod without asking" is not a control. Gate writes in code.
One mega-tool (do_ops_thing) with a free-text command. Ambiguous to select and impossible to validate. Prefer narrow, single-purpose tools.
Letting verbose log output flood the context. Trim tool results and delegate noisy discovery to a subagent so the coordinator's context stays clean (Domain 1 + Domain 5 overlap).
Forcing tools. tool_choice: "any" would stop the agent from ever giving a final answer; reserve forced choice for when a tool call is genuinely mandatory.

How it maps to the domains

Domain 1 (Agentic Architecture & Orchestration): the agentic loop, stop_reason handling, model-driven reads vs hard-coded write gates, deterministic hooks vs probabilistic prompts, and HITL placement.
Domain 2 (Tool Design & MCP Integration): descriptions as the selection mechanism, schema/enum design, tool_choice modes, and structured is_error results. If these tools live behind MCP servers, the same contracts apply — plus transport choice (stdio vs HTTP) and secrets via env vars.