Structured Data Extraction
Scenario 6: Structured Data Extraction
The problem
A fintech team is building an invoice-ingestion pipeline. Vendors email PDFs and scanned images; an upstream OCR step produces messy text. The team needs Claude to turn each document into a typed record — vendor, invoice_number, currency, total_cents, and a line_items array — that flows directly into a database and a payments service. Today an engineer prompts Claude with "Return the data as JSON" and parses the reply with a regex. It works ~85% of the time, then breaks on a markdown fence, a "Here is the JSON you requested:" preamble, a trailing apology, or a number formatted as $1,250.00 instead of an integer. The architect's job is to make extraction machine-parseable by construction, not by luck.
The right architecture
The exam-correct answer is structured output via tool_use with a JSON Schema, not prose "JSON only" prompting.
- Define a tool whose
input_schemais the JSON Schema for the target record. The schema is the contract: types,enumconstraints, andrequiredfields are declared once and reused for generation and validation. - Force the tool with
tool_choice: { "type": "tool", "name": "record_invoice" }. The model can now only respond by filling your schema — no preambles, no fences, no commentary. - Read the payload from the
tool_usecontent block'sinputfield. When a tool is forced,stop_reasonistool_use. Because you are using the tool only to shape output, you do not need a tool-result round-trip — readinputand stop.
Key decisions
- Schema design encodes business rules. Use
integercents instead of floats to avoid currency rounding bugs. Useenumto restrict currency to supported values. Mark only truly mandatory fieldsrequired; leave genuinely optional fields out so the model is not pushed to hallucinate a value. - Represent "missing" explicitly. A field that is sometimes absent should allow
null(or be omitted fromrequired). This is how you get the model to say "not present" instead of inventing an invoice number. This is the structured-output equivalent of giving Claude an "I don't know" exit. - Let the model reason, then force the tool. Structured output composes with chain-of-thought: allow a thinking step in text, then make the forced output tool the final action. The schema constrains only the last turn.
- Schema guarantees shape, not truth.
tool_usemakestotal_centsan integer; it cannot guarantee the integer is correct. Keep a separate value-validation step (cross-check that line items sum to the total; verify currency is plausible for the vendor) and route low-confidence records to human review. - Scale with batching. Thousands of invoices per night that are not latency-sensitive belong on the Message Batches API for roughly half the cost, with prompt caching on the shared schema/system instructions to cut input tokens further.
Common traps
- Prose "return only JSON". Brittle: markdown fences, preambles, and trailing text break parsers. The exam treats this as the wrong answer whenever a tool-schema approach is available.
- Confusing
tool_choicemodes.autolets the model decide;anyforces some tool but lets the model pick; only{ "type": "tool", "name": ... }forces your tool. For deterministic extraction you need the named form. - Reading the wrong block. With forced tools the answer is in the
tool_useblock'sinput, not atextblock. Code that scanstextwill see nothing. - Trusting the schema for correctness. Treating a passing schema as "verified data" skips semantic validation and ships wrong numbers to a payments system.
- Over-constraining. Marking optional fields
requiredforces the model to fabricate values to satisfy the schema.
How it maps to Domain 4
Domain 4 (Prompt Engineering & Structured Output) tests exactly this competency: shaping Claude's output so downstream code consumes it without fragile parsing. The exam-canonical pattern — JSON Schema as a tool input_schema plus tool_choice forcing, reading the result from the tool_use block, and adding a separate value-validation layer because the schema enforces structure rather than semantics — is the core takeaway. The batching/caching cost decision connects to platform foundations, but the extraction architecture itself is squarely Domain 4.