返回情景
情景 6 / 8

结构化数据提取

Scenario 6: Structured Data Extraction

The problem

A fintech team is building an invoice-ingestion pipeline. Vendors email PDFs and scanned images; an upstream OCR step produces messy text. The team needs Claude to turn each document into a typed record — vendor, invoice_number, currency, total_cents, and a line_items array — that flows directly into a database and a payments service. Today an engineer prompts Claude with "Return the data as JSON" and parses the reply with a regex. It works ~85% of the time, then breaks on a markdown fence, a "Here is the JSON you requested:" preamble, a trailing apology, or a number formatted as $1,250.00 instead of an integer. The architect's job is to make extraction machine-parseable by construction, not by luck.

The right architecture

The exam-correct answer is structured output via tool_use with a JSON Schema, not prose "JSON only" prompting.

  1. Define a tool whose input_schema is the JSON Schema for the target record. The schema is the contract: types, enum constraints, and required fields are declared once and reused for generation and validation.
  2. Force the tool with tool_choice: { "type": "tool", "name": "record_invoice" }. The model can now only respond by filling your schema — no preambles, no fences, no commentary.
  3. Read the payload from the tool_use content block's input field. When a tool is forced, stop_reason is tool_use. Because you are using the tool only to shape output, you do not need a tool-result round-trip — read input and stop.
json

Key decisions

  • Schema design encodes business rules. Use integer cents instead of floats to avoid currency rounding bugs. Use enum to restrict currency to supported values. Mark only truly mandatory fields required; leave genuinely optional fields out so the model is not pushed to hallucinate a value.
  • Represent "missing" explicitly. A field that is sometimes absent should allow null (or be omitted from required). This is how you get the model to say "not present" instead of inventing an invoice number. This is the structured-output equivalent of giving Claude an "I don't know" exit.
  • Let the model reason, then force the tool. Structured output composes with chain-of-thought: allow a thinking step in text, then make the forced output tool the final action. The schema constrains only the last turn.
  • Schema guarantees shape, not truth. tool_use makes total_cents an integer; it cannot guarantee the integer is correct. Keep a separate value-validation step (cross-check that line items sum to the total; verify currency is plausible for the vendor) and route low-confidence records to human review.
  • Scale with batching. Thousands of invoices per night that are not latency-sensitive belong on the Message Batches API for roughly half the cost, with prompt caching on the shared schema/system instructions to cut input tokens further.

Common traps

  • Prose "return only JSON". Brittle: markdown fences, preambles, and trailing text break parsers. The exam treats this as the wrong answer whenever a tool-schema approach is available.
  • Confusing tool_choice modes. auto lets the model decide; any forces some tool but lets the model pick; only { "type": "tool", "name": ... } forces your tool. For deterministic extraction you need the named form.
  • Reading the wrong block. With forced tools the answer is in the tool_use block's input, not a text block. Code that scans text will see nothing.
  • Trusting the schema for correctness. Treating a passing schema as "verified data" skips semantic validation and ships wrong numbers to a payments system.
  • Over-constraining. Marking optional fields required forces the model to fabricate values to satisfy the schema.

How it maps to Domain 4

Domain 4 (Prompt Engineering & Structured Output) tests exactly this competency: shaping Claude's output so downstream code consumes it without fragile parsing. The exam-canonical pattern — JSON Schema as a tool input_schema plus tool_choice forcing, reading the result from the tool_use block, and adding a separate value-validation layer because the schema enforces structure rather than semantics — is the core takeaway. The batching/caching cost decision connects to platform foundations, but the extraction architecture itself is squarely Domain 4.