Machine Learning Academy · Lesson

Random Feature Selection: The Random Forest Trick

Learners will configure max_features in RandomForestClassifier, observe how feature sub-sampling decorrelates trees, and watch test accuracy rise.

From Bagging to Random Forests

A plain bagging ensemble trains each tree on a different bootstrap sample, but all trees can still use every feature when choosing a split. This means the most predictive feature will dominate every tree, making them highly correlated. When correlated models are averaged, variance reduction is limited. Random Forests add one extra trick: at every split, only a random subset of features is considered. This decorrelates the trees and dramatically improves the ensemble's generalisation.

The max_features Parameter

In RandomForestClassifier, the max_features parameter controls how many features are candidates at each split. Common choices are 'sqrt' (square root of the total number of features, the default for classification), 'log2', or an integer/float. For regression (RandomForestRegressor), the default is 1.0 (all features), and 'sqrt' or 0.33 are popular alternatives. Smaller values create more diverse trees at the cost of each tree being slightly weaker individually.

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import cross_val_score
import numpy as np

X, y = load_breast_cancer(return_X_y=True)
for mf in ['sqrt', 'log2', 0.5, 1.0]:
    rf = RandomForestClassifier(n_estimators=100, max_features=mf, random_state=42)
    score = cross_val_score(rf, X, y, cv=5).mean()
    print(f'max_features={str(mf):6s}: CV accuracy={score:.4f}')

All lessons in this course

Bootstrap Aggregation (Bagging) Explained
Random Feature Selection: The Random Forest Trick
Out-of-Bag Error: Free Validation Inside the Forest
Voting Ensembles: Hard Vote vs Soft Vote

← Back to Machine Learning Academy