Data Drift: Feature Distribution Shifts Over Time
Learners will simulate drift by gradually shifting an input feature, compute the Population Stability Index (PSI) and KL divergence, and set threshold-based alerts.
What Is Data Drift?
Data drift (also called covariate shift) occurs when the statistical distribution of input features changes after deployment compared to the distribution seen during training. A fraud detection model trained on 2022 transaction patterns may encounter very different transaction amounts and merchant categories by 2024. The model's learned decision boundaries no longer match the new data distribution, causing silent performance degradation that only becomes visible through monitoring.
Simulating Drift: Gradual Feature Shift
To study drift, we can simulate it by gradually shifting a feature's mean over time. In production, this might represent seasonal changes in user behaviour, economic shifts affecting purchasing power, or evolving fraud patterns. Plotting the feature distribution for each week reveals when the shift becomes statistically significant and should trigger a retraining alert.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
np.random.seed(42)
# Training distribution: income ~ Normal(50000, 10000)
train_income = np.random.normal(50000, 10000, 5000)
# Production weeks 1-12: mean gradually shifts from 50k to 65k
prod_weeks = []
for week in range(1, 13):
shifted_mean = 50000 + week * 1250 # +1250 per week
week_data = np.random.normal(shifted_mean, 10000, 500)
prod_weeks.append({'week': week, 'income': week_data})
print('Training mean:', train_income.mean().round(0))
for pw in [prod_weeks[0], prod_weeks[5], prod_weeks[-1]]:
print(f'Week {pw["week"]} mean: {pw["income"].mean().round(0)}')All lessons in this course
- Data Drift: Feature Distribution Shifts Over Time
- Concept Drift: When the Relationship Between X and Y Changes
- Monitoring Prediction Distributions and Confidence Scores
- Building a Drift Alert Pipeline with Evidently AI