Machine Learning Academy · Lesson

Detecting Imbalance: Class Distribution and Baseline Pitfalls

Learners will compute class frequencies, expose the dummy-classifier trap on an imbalanced dataset, and confirm that accuracy is a misleading metric here.

What Is Class Imbalance?

Class imbalance occurs when one class dominates the dataset. In fraud detection, fraudulent transactions may be 0.1% of all records; in medical diagnosis, a rare disease may affect 1 in 1000 patients. A model that always predicts the majority class achieves 99.9% accuracy while completely ignoring the rare, critical class — making accuracy a dangerously misleading metric in these scenarios.

Measuring Class Distribution

Before training any model, inspect the class distribution with pd.Series(y).value_counts() or np.bincount(y). A useful summary statistic is the imbalance ratio — the ratio of majority to minority class count. Ratios above 10:1 are considered imbalanced; above 100:1 are severely imbalanced and require specialised techniques.

import numpy as np
import pandas as pd

# Simulate imbalanced binary classification dataset
np.random.seed(0)
y = np.array([0] * 950 + [1] * 50)  # 95% negative, 5% positive

counts = pd.Series(y).value_counts()
print('Class counts:\n', counts)
print()
print('Class proportions:\n', counts / len(y))
print()
print('Imbalance ratio:', counts[0] / counts[1])

All lessons in this course

Detecting Imbalance: Class Distribution and Baseline Pitfalls
Random Oversampling and SMOTE
Random Undersampling and Cluster Centroids
Class Weights and Threshold Moving

← Back to Machine Learning Academy