Schema Evolution and Backward Compatibility
Manage breaking schema changes in long-running extraction pipelines by versioning schemas, migrating historical extractions, and running parallel validation during transitions.
Why Schemas Change Over Time
Extraction schemas are not static. Business requirements evolve, new document types appear, and you discover fields you should have captured from the start. Changing a schema in a live pipeline creates a backward compatibility problem: existing extracted records use the old schema, while new records use the new one. Managing this transition safely is what schema evolution is about.
Versioning Your Schemas
Assign a version number to each schema and store it alongside every extracted record. When you change the schema, increment the version. This lets you query records by schema version, run migrations on old records, and maintain separate validation logic for each version. A simple string field schema_version in every output model is sufficient.
from pydantic import BaseModel
from typing import Literal
class InvoiceV1(BaseModel):
schema_version: Literal['1.0'] = '1.0'
vendor: str
total_amount: float
class InvoiceV2(BaseModel):
schema_version: Literal['2.0'] = '2.0'
vendor: str
vendor_tax_id: str | None = None # new field
total_amount: float
currency: str = 'USD' # new field with defaultAll lessons in this course
- Instructor: Typed Extraction with Pydantic
- Handling Partial and Missing Data
- Batch Processing with Async and Queues
- Schema Evolution and Backward Compatibility