Learning Rate: The Most Important Hyperparameter
Learners will run a learning-rate range test, plot loss vs LR, identify the optimal range, and apply a CosineAnnealingLR schedule to avoid plateaus.
Why Learning Rate Matters Most
The learning rate (LR) is the single hyperparameter that most affects whether a neural network trains successfully. It controls how large a step the optimizer takes in the direction of the negative gradient. Too large and the model diverges; too small and training takes forever or gets stuck. Unlike architecture choices, LR must be tuned almost every time you change the dataset, model size, or batch size.
import torch
import torch.nn as nn
import torch.optim as optim
model = nn.Linear(1, 1)
# Too large: diverges
optimizer_big = optim.SGD(model.parameters(), lr=10.0)
# Too small: barely moves
optimizer_small = optim.SGD(model.parameters(), lr=1e-6)
# Good: converges steadily
optimizer_good = optim.SGD(model.parameters(), lr=0.01)
print('LR comparison: 10.0, 1e-6, 0.01')Effect of LR on Loss Curves
Different learning rates produce recognisable patterns in the loss curve. Too high: loss oscillates wildly or increases after a few steps. Too low: loss decreases extremely slowly, nearly flat. Just right: loss decreases smoothly and consistently. Plotting loss vs batch/epoch for a few representative LR values (e.g., 1e-4, 1e-3, 1e-2, 1e-1) before committing to a long training run is standard practice.
import torch
import torch.nn as nn
import torch.optim as optim
def train_one_lr(lr, steps=50):
model = nn.Linear(1, 1)
opt = optim.SGD(model.parameters(), lr=lr)
X = torch.randn(100, 1)
y = 2 * X + 1
losses = []
for _ in range(steps):
opt.zero_grad()
loss = nn.MSELoss()(model(X), y)
loss.backward(); opt.step()
losses.append(loss.item())
return losses[-1]
for lr in [1e-4, 1e-2, 0.1, 1.0]:
final = train_one_lr(lr)
print(f'LR={lr:.4f}: final_loss={final:.4f}')All lessons in this course
- Learning Rate: The Most Important Hyperparameter
- Batch Normalisation: Stable and Faster Training
- Dropout Regularisation to Prevent Overfitting
- Weight Initialisation: Xavier and He Initialisation