Machine Learning Academy · Lesson

Random Undersampling and Cluster Centroids

Learners will undersample the majority class at random and with ClusterCentroids, comparing the information loss and resulting model performance.

Undersampling: Reducing the Majority Class

While oversampling grows the minority class, undersampling shrinks the majority class to balance the dataset. The key advantage: training is faster because there is less data overall. The key risk: you discard real majority-class information, potentially causing the model to make more false positives. Undersampling is most useful when the majority class is so large that oversampling would make training prohibitively slow.

Random Undersampling: Delete Majority Examples at Random

RandomUnderSampler randomly selects and removes majority-class samples until the desired ratio is achieved. It is the fastest undersampling method but throws away potentially useful information. With small datasets, random undersampling can make the model significantly worse by discarding majority samples near the decision boundary.

from imblearn.under_sampling import RandomUnderSampler
from sklearn.datasets import make_classification
import numpy as np

X, y = make_classification(n_samples=1000, weights=[0.95, 0.05],
                            n_features=10, random_state=42)

print('Before undersampling:', np.bincount(y))

rus = RandomUnderSampler(sampling_strategy=1.0, random_state=42)
X_res, y_res = rus.fit_resample(X, y)

print('After undersampling:', np.bincount(y_res))
print('New total samples:', len(y_res))

All lessons in this course

Detecting Imbalance: Class Distribution and Baseline Pitfalls
Random Oversampling and SMOTE
Random Undersampling and Cluster Centroids
Class Weights and Threshold Moving

← Back to Machine Learning Academy