Handling Partial and Missing Data
Design schemas with Optional fields and confidence scores, implement fallback extraction strategies for ambiguous documents, and log low-confidence extractions for human review.
The Reality of Incomplete Documents
Real-world documents rarely contain every field your schema expects. An invoice might be missing a PO number, a resume might omit dates, and a news article might not mention a location. Designing your extraction schema to handle partial and missing data gracefully is as important as extracting what is present.
Optional Fields in Pydantic
Mark fields that might not appear in every document as Optional[type] and give them a None default. Pydantic v2 treats these fields as nullable, and the model is instructed not to hallucinate values when information is absent. Always prefer None over an empty string for missing data — it is easier to filter downstream.
from pydantic import BaseModel, Field
from typing import Optional
class JobPosting(BaseModel):
title: str
company: str
salary_min: Optional[float] = Field(None, description='Minimum salary if stated')
salary_max: Optional[float] = Field(None, description='Maximum salary if stated')
remote: Optional[bool] = Field(None, description='True if remote, False if on-site, None if unspecified')All lessons in this course
- Instructor: Typed Extraction with Pydantic
- Handling Partial and Missing Data
- Batch Processing with Async and Queues
- Schema Evolution and Backward Compatibility