Machine Learning Academy · Lesson

Multiple Linear Regression and Feature Importance

Learners will extend the model to multiple input features, interpret each coefficient as feature importance, and evaluate with R-squared.

Beyond a Single Feature

Real prediction problems almost always have multiple relevant inputs. House price depends on square footage, number of bedrooms, location, age, and dozens of other factors. Multiple linear regression extends the single-feature line to use all available features simultaneously:

ŷ = w₁x₁ + w₂x₂ + ... + wₙxₙ + b

Each feature gets its own weight, and the model learns all weights simultaneously by minimising MSE. Adding more relevant features almost always improves prediction accuracy — as long as those features genuinely carry information about the target.

import pandas as pd
import numpy as np

# Multi-feature housing dataset
np.random.seed(42)
n = 300
df = pd.DataFrame({
    'sqft': np.random.randint(600, 4000, n),
    'bedrooms': np.random.randint(1, 7, n),
    'bathrooms': np.random.randint(1, 5, n),
    'age': np.random.randint(1, 60, n),
    'distance_to_center': np.random.uniform(1, 30, n)
})
df['price'] = (150 * df['sqft'] + 8000 * df['bedrooms']
               - 300 * df['age'] - 5000 * df['distance_to_center']
               + 40000 + np.random.normal(0, 20000, n))
print(df.head())

Training with Multiple Features

The scikit-learn API is identical for one feature or one hundred features — you simply pass the full feature matrix. The model automatically fits one coefficient per column in X.

With multiple features, R² typically increases compared to the single-feature model, because each additional informative feature provides the model with more signal to work with. However, adding irrelevant or noisy features can actually hurt performance by introducing overfitting, especially on small datasets.

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
import pandas as pd
import numpy as np

# (df defined in previous scene)
features = ['sqft', 'bedrooms', 'bathrooms', 'age', 'distance_to_center']
X = df[features]
y = df['price']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
print(f'Multi-feature R2: {r2_score(y_test, y_pred):.3f}')

All lessons in this course

The Equation of a Line: Slope, Intercept, and Predictions
Cost Functions and Least Squares
Training Linear Regression with scikit-learn
Multiple Linear Regression and Feature Importance

← Back to Machine Learning Academy