Stratified and Time-Series Cross-Validation
Learners will apply StratifiedKFold to maintain class ratios and TimeSeriesSplit to avoid future data leaking into past folds in time-ordered datasets.
When Standard K-Fold Fails
Standard KFold splits data randomly without regard for class distribution or temporal order. This creates two serious problems: (1) for imbalanced datasets, some folds may have very few or no minority class examples, making fold-to-fold variation extreme; (2) for time-ordered data, training on future examples to predict the past creates data leakage that inflates CV scores far beyond real-world performance. Specialised CV strategies solve these problems without sacrificing honest evaluation.
Stratified K-Fold: Preserving Class Ratios
StratifiedKFold ensures that each fold contains approximately the same proportion of each class as the full dataset. For example, if 10% of your data is fraudulent, each fold will have approximately 10% fraud cases. This prevents the scenario where one fold has no fraud cases, making it impossible for the classifier to learn to detect fraud in that round. Use StratifiedKFold whenever you have more than 2 classes or significant class imbalance.
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
import numpy as np
X, y = load_breast_cancer(return_X_y=True)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(RandomForestClassifier(random_state=42), X, y, cv=skf)
print('Stratified 5-fold:', np.round(scores, 4))
print('Mean:', round(scores.mean(), 4), 'Std:', round(scores.std(), 4))All lessons in this course
- K-Fold Cross-Validation: Splitting Without Leaking
- Stratified and Time-Series Cross-Validation
- Grid Search vs Random Search
- Nested Cross-Validation: Selecting and Evaluating Simultaneously