Machine Learning Academy · Lesson

Learning Rate: The Most Important Hyperparameter

Learners will run a learning-rate range test, plot loss vs LR, identify the optimal range, and apply a CosineAnnealingLR schedule to avoid plateaus.

Why Learning Rate Matters Most

The learning rate (LR) is the single hyperparameter that most affects whether a neural network trains successfully. It controls how large a step the optimizer takes in the direction of the negative gradient. Too large and the model diverges; too small and training takes forever or gets stuck. Unlike architecture choices, LR must be tuned almost every time you change the dataset, model size, or batch size.

import torch
import torch.nn as nn
import torch.optim as optim

model = nn.Linear(1, 1)

# Too large: diverges
optimizer_big = optim.SGD(model.parameters(), lr=10.0)

# Too small: barely moves
optimizer_small = optim.SGD(model.parameters(), lr=1e-6)

# Good: converges steadily
optimizer_good = optim.SGD(model.parameters(), lr=0.01)

print('LR comparison: 10.0, 1e-6, 0.01')

Effect of LR on Loss Curves

Different learning rates produce recognisable patterns in the loss curve. Too high: loss oscillates wildly or increases after a few steps. Too low: loss decreases extremely slowly, nearly flat. Just right: loss decreases smoothly and consistently. Plotting loss vs batch/epoch for a few representative LR values (e.g., 1e-4, 1e-3, 1e-2, 1e-1) before committing to a long training run is standard practice.

import torch
import torch.nn as nn
import torch.optim as optim

def train_one_lr(lr, steps=50):
    model = nn.Linear(1, 1)
    opt = optim.SGD(model.parameters(), lr=lr)
    X = torch.randn(100, 1)
    y = 2 * X + 1
    losses = []
    for _ in range(steps):
        opt.zero_grad()
        loss = nn.MSELoss()(model(X), y)
        loss.backward(); opt.step()
        losses.append(loss.item())
    return losses[-1]

for lr in [1e-4, 1e-2, 0.1, 1.0]:
    final = train_one_lr(lr)
    print(f'LR={lr:.4f}: final_loss={final:.4f}')

All lessons in this course

Learning Rate: The Most Important Hyperparameter
Batch Normalisation: Stable and Faster Training
Dropout Regularisation to Prevent Overfitting
Weight Initialisation: Xavier and He Initialisation

← Back to Machine Learning Academy