Feature Selection: Variance Threshold and SelectKBest
Learners will remove near-zero-variance features, then rank remaining features by univariate statistical tests and keep only the top K most informative.
Why Remove Features?
Adding more features is not always better. Irrelevant features add noise without signal, increasing the risk of overfitting, slowing training, and wasting memory. Redundant features provide duplicate information and can destabilise some models. Feature selection identifies and removes these unhelpful inputs, leaving a compact set of informative features. The result: faster training, lower memory usage, reduced overfitting, and often better generalisation — especially for models without built-in regularisation like KNN or Naive Bayes.
Variance Threshold: Remove Near-Constant Features
VarianceThreshold removes features whose variance falls below a specified threshold. A feature with zero variance is constant — it has the same value for every example and therefore provides no discriminative information. Near-constant features (very low variance) are almost as uninformative. Setting threshold=0 removes only perfectly constant features; threshold=0.01 removes any feature where 99%+ of values are the same. This is a fast, model-agnostic first pass at cleaning the feature set.
from sklearn.feature_selection import VarianceThreshold
import numpy as np
# Create feature matrix with some constant/near-constant columns
X = np.array([
[1, 2, 1, 5], # col 3: near constant (mostly 1)
[2, 4, 1, 3],
[3, 6, 1, 4],
[4, 8, 2, 7], # col 3 varied
[5, 10, 1, 2]
], dtype=float)
print('Column variances:', X.var(axis=0).round(2))
vt = VarianceThreshold(threshold=0.5) # remove columns with variance < 0.5
X_filtered = vt.fit_transform(X)
print('Features kept:', vt.get_support())
print('X shape before:', X.shape, '-> after:', X_filtered.shape)All lessons in this course
- Creating New Features: Log Transforms, Binning, and Interactions
- Date and Time Feature Extraction
- Feature Selection: Variance Threshold and SelectKBest
- Recursive Feature Elimination with Cross-Validation