Schema Validation and Assertions
Write assertion checks on column ranges, non-null constraints, and unique keys that run at the start of every pipeline.
Why Schema Validation Matters
A data pipeline processes new data automatically, often without human review. If the schema of the input file changes — a column is renamed, a date format shifts, or a new category appears — the pipeline should fail loudly rather than produce silently wrong output. Schema validation with assertions is the mechanism that enforces contracts between data producers and data consumers, catching problems at the earliest possible moment.
import pandas as pd
import numpy as np
df = pd.read_parquet('sales_treated.parquet')
print('Loaded:', df.shape)
print(df.dtypes)Required Column Checks
The most fundamental validation is checking that all required columns are present. Store the expected column set in a config and assert that it is a subset of the actual columns. This check catches renames and drops immediately at pipeline start, before any downstream code attempts to access missing columns and raises a confusing KeyError deep in the pipeline.
REQUIRED_COLUMNS = {'order_id', 'order_date', 'customer_id',
'product', 'category', 'quantity', 'unit_price', 'revenue'}
missing = REQUIRED_COLUMNS - set(df.columns)
assert not missing, f'Missing required columns: {missing}'
print('All required columns present.')All lessons in this course
- Detecting and Removing Duplicates
- Outlier Detection and Treatment
- Standardising Inconsistent Categories
- Schema Validation and Assertions