Validating and Retrying Bad Outputs
Implement a validation layer that checks extracted data against business rules, automatically retries with corrective feedback when validation fails, and logs failure patterns.
Why LLM Outputs Need Validation
Even with structured outputs and Pydantic schemas, LLM extraction can produce outputs that are syntactically valid but semantically wrong. A confidence score of 1.5 (outside the 0-1 range), a price of -99.99, a date string that cannot be parsed, or a phone number with letters — all of these pass JSON parsing but fail your business rules.
Validation is a separate concern from extraction. Extraction asks: 'Did we get structured data?' Validation asks: 'Is the structured data correct and usable?' Both layers are necessary for a production-grade pipeline. Think of it as a two-stage filter: the LLM extracts, your validator accepts or rejects.
Layers of Validation
A robust output validation system operates at multiple levels:
- Schema validation (Pydantic): correct field types, required fields present, enums match allowed values — handled automatically by structured outputs
- Format validation: phone numbers match a regex, emails are valid, dates are parseable, amounts are within realistic ranges
- Business logic validation: invoice total equals sum of line items, end date is after start date, quantity is a positive integer
- Cross-field validation: a field's value depends on another field's value (e.g., discount percent cannot exceed 100)
- Semantic validation: extracted company name matches a known company in your database
All lessons in this course
- JSON Mode and response_format
- Structured Outputs with Pydantic
- Extracting Data from Unstructured Text
- Validating and Retrying Bad Outputs