Adam & AdamW Explained
Adaptive rates plus decoupled weight decay.
One Rate Per Weight
SGD uses a single learning rate for every parameter. Adam adapts the step size for each weight on its own, based on that weight's gradient history.
Two Moving Averages
Adam tracks two running averages: the mean of gradients and the mean of their squares. Together they form the first and second moments.
All lessons in this course
- SGD with Momentum
- Adam & AdamW Explained
- Weight Decay vs L2 Regularization
- Learning Rate Schedules & Warmup