Weight Decay vs L2 Regularization
The subtle difference that matters.
Keep Weights Small
Big weights often mean an overfit model. Both weight decay and L2 regularization push weights toward zero so the network stays simpler.
L2 Adds to the Loss
L2 regularization adds a penalty term, the sum of squared weights, straight into the loss. Minimizing loss then also means shrinking the weights.
loss = data_loss + lam * (w ** 2).sum()All lessons in this course
- SGD with Momentum
- Adam & AdamW Explained
- Weight Decay vs L2 Regularization
- Learning Rate Schedules & Warmup