Random Undersampling and Cluster Centroids
Learners will undersample the majority class at random and with ClusterCentroids, comparing the information loss and resulting model performance.
Undersampling: Reducing the Majority Class
While oversampling grows the minority class, undersampling shrinks the majority class to balance the dataset. The key advantage: training is faster because there is less data overall. The key risk: you discard real majority-class information, potentially causing the model to make more false positives. Undersampling is most useful when the majority class is so large that oversampling would make training prohibitively slow.
Random Undersampling: Delete Majority Examples at Random
RandomUnderSampler randomly selects and removes majority-class samples until the desired ratio is achieved. It is the fastest undersampling method but throws away potentially useful information. With small datasets, random undersampling can make the model significantly worse by discarding majority samples near the decision boundary.
from imblearn.under_sampling import RandomUnderSampler
from sklearn.datasets import make_classification
import numpy as np
X, y = make_classification(n_samples=1000, weights=[0.95, 0.05],
n_features=10, random_state=42)
print('Before undersampling:', np.bincount(y))
rus = RandomUnderSampler(sampling_strategy=1.0, random_state=42)
X_res, y_res = rus.fit_resample(X, y)
print('After undersampling:', np.bincount(y_res))
print('New total samples:', len(y_res))All lessons in this course
- Detecting Imbalance: Class Distribution and Baseline Pitfalls
- Random Oversampling and SMOTE
- Random Undersampling and Cluster Centroids
- Class Weights and Threshold Moving