Detecting Imbalance: Class Distribution and Baseline Pitfalls
Learners will compute class frequencies, expose the dummy-classifier trap on an imbalanced dataset, and confirm that accuracy is a misleading metric here.
What Is Class Imbalance?
Class imbalance occurs when one class dominates the dataset. In fraud detection, fraudulent transactions may be 0.1% of all records; in medical diagnosis, a rare disease may affect 1 in 1000 patients. A model that always predicts the majority class achieves 99.9% accuracy while completely ignoring the rare, critical class — making accuracy a dangerously misleading metric in these scenarios.
Measuring Class Distribution
Before training any model, inspect the class distribution with pd.Series(y).value_counts() or np.bincount(y). A useful summary statistic is the imbalance ratio — the ratio of majority to minority class count. Ratios above 10:1 are considered imbalanced; above 100:1 are severely imbalanced and require specialised techniques.
import numpy as np
import pandas as pd
# Simulate imbalanced binary classification dataset
np.random.seed(0)
y = np.array([0] * 950 + [1] * 50) # 95% negative, 5% positive
counts = pd.Series(y).value_counts()
print('Class counts:\n', counts)
print()
print('Class proportions:\n', counts / len(y))
print()
print('Imbalance ratio:', counts[0] / counts[1])All lessons in this course
- Detecting Imbalance: Class Distribution and Baseline Pitfalls
- Random Oversampling and SMOTE
- Random Undersampling and Cluster Centroids
- Class Weights and Threshold Moving