Bias-Variance Trade-off: Underfitting vs Overfitting
Learners will plot training vs validation error curves, identify under- and overfitting regions, and understand the bias-variance decomposition conceptually.
The Central Tension in ML
Every machine learning model faces a fundamental tension between two competing forces: its ability to capture complex patterns in data (flexibility) and its ability to generalise those patterns to new examples (regularity). Too much flexibility leads to overfitting; too little leads to underfitting. Finding the sweet spot is the core challenge of model selection and hyperparameter tuning.
The bias-variance trade-off gives this tension a mathematical name and framework. Understanding it is essential for diagnosing what is wrong with a model and knowing exactly how to fix it.
Bias: Systematic Error from Wrong Assumptions
Bias is the error that comes from wrong assumptions in the learning algorithm. A high-bias model is too simple to capture the true relationship between features and the target — it makes systematically wrong predictions regardless of how much training data you give it.
The classic example: trying to fit a straight line to data that has a clear non-linear (curved) pattern. No matter how much data you have, the line will miss the curve. The model is underfitting — it has high bias because its linearity assumption is wrong for this problem.
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
# Non-linear data: y = sin(x) + noise
np.random.seed(42)
X = np.sort(np.random.uniform(0, 10, 200)).reshape(-1, 1)
y = np.sin(X.ravel()) + np.random.randn(200) * 0.2
# High-bias model: straight line on non-linear data
linear_model = LinearRegression()
linear_model.fit(X, y)
y_pred_linear = linear_model.predict(X)
print(f'Linear model train MSE: {mean_squared_error(y, y_pred_linear):.3f}')
# High error even on training data -- high biasAll lessons in this course
- Why You Cannot Evaluate on Training Data
- train_test_split: Ratios, Seeds, and Stratification
- Bias-Variance Trade-off: Underfitting vs Overfitting
- Baseline Models: Always Beat the Dummy Classifier