Bootstrap Aggregation (Bagging) Explained
Learners will implement bootstrap sampling by hand, train classifiers on each sample, and understand why averaging diverse models reduces variance.
What Is Bootstrap Aggregation?
Bootstrap Aggregation, commonly called Bagging, is an ensemble technique that trains multiple models on different random subsets of the training data and combines their predictions. The word bootstrap comes from statistics and means sampling with replacement from your data. By averaging or voting across many diverse models, bagging dramatically reduces the variance of the final prediction without increasing bias.
Sampling With Replacement Explained
Sampling with replacement means each bootstrap sample is drawn independently from the full dataset — the same row can appear multiple times in a single sample while other rows may be excluded entirely. For a dataset of N examples, each bootstrap sample also contains N rows. On average, about 63.2% of unique examples appear in any given bootstrap sample, and the remaining ~37% form the out-of-bag set that can be used for validation.
import numpy as np
np.random.seed(42)
data = np.arange(10) # [0, 1, 2, ..., 9]
bootstrap_sample = np.random.choice(data, size=len(data), replace=True)
print('Original:', data)
print('Bootstrap sample:', bootstrap_sample)All lessons in this course
- Bootstrap Aggregation (Bagging) Explained
- Random Feature Selection: The Random Forest Trick
- Out-of-Bag Error: Free Validation Inside the Forest
- Voting Ensembles: Hard Vote vs Soft Vote