Cost Functions and Least Squares
Learners will calculate mean squared error, visualise the cost surface, and understand why minimising error leads to the best-fit parameters.
What Makes a Line 'Best'?
Given a scatter plot of data points, infinitely many lines could pass through or near the data. The question is: which line is the best? We need a formal mathematical definition of 'best' that we can optimise algorithmically.
The answer is a cost function (also called a loss function or objective function) — a single number that measures how wrong the model's predictions are across all training examples. The best line is the one that minimises the cost function. This turns model training into a mathematical optimisation problem.
Mean Squared Error: The Standard Cost Function
The most common cost function for regression is Mean Squared Error (MSE). For each training example, you compute the residual (actual minus predicted), square it, then average across all examples:
MSE = (1/n) × Σ(yᵢ - ŷᵢ)²
Squaring has two important effects: it makes all errors positive (so positive and negative errors do not cancel out), and it penalises large errors more than small errors (a residual of 10 contributes 100, not 10, to the cost). This means MSE pushes the model to avoid large mistakes.
import numpy as np
y_actual = np.array([250000, 300000, 350000, 200000, 400000])
y_pred = np.array([240000, 320000, 330000, 210000, 380000])
# Compute MSE manually
residuals = y_actual - y_pred
squared_residuals = residuals ** 2
mse = squared_residuals.mean()
print('Residuals:', residuals)
print('Squared residuals:', squared_residuals)
print(f'MSE: {mse:,.0f}')
print(f'RMSE (interpretable): ${np.sqrt(mse):,.0f}')