AI Prompt Engineering · Lesson

Schema-Driven Data Extraction

Providing JSON schemas in prompts to guarantee structured output format.

Why Schema-Driven Extraction?

When you tell a model extract the important data, you get inconsistent, unpredictable output. When you provide a JSON schema and say extract data matching this exact schema, you get machine-readable, consistent, type-safe output every time.

Schema-driven extraction is the pattern used in production systems that process invoices, contracts, medical records, meeting notes, and any document where structured data must be reliably extracted from unstructured text.

Providing the Schema in the Prompt

The schema lives directly in the prompt. The model uses it as the output contract:

import anthropic, json

client = anthropic.Anthropic(api_key='YOUR_API_KEY')

INVOICE_SCHEMA = '''
{
  "invoice_number": "string",
  "vendor_name": "string",
  "vendor_address": "string or null",
  "invoice_date": "YYYY-MM-DD",
  "due_date": "YYYY-MM-DD or null",
  "line_items": [
    {
      "description": "string",
      "quantity": "number",
      "unit_price": "number",
      "total": "number"
    }
  ],
  "subtotal": "number",
  "tax": "number or null",
  "total_amount": "number",
  "currency": "3-letter ISO code e.g. USD"
}
'''

def extract_invoice(invoice_text):
    prompt = f'Extract structured data from this invoice.\nReturn JSON matching this schema exactly:\n{INVOICE_SCHEMA}\n\nInvoice:\n{invoice_text}'
    r = client.messages.create(model='claude-opus-4-5', max_tokens=500, messages=[{'role': 'user', 'content': prompt}])
    return json.loads(r.content[0].text)

print('Invoice schema defined.')

All lessons in this course

Named Entity Extraction Prompts
Schema-Driven Data Extraction
LLM as Text Classifier
Confidence and Uncertainty in Classification

← Back to AI Prompt Engineering