Deep Learning Academy · Lesson

Adam & AdamW Explained

Adaptive rates plus decoupled weight decay.

One Rate Per Weight

SGD uses a single learning rate for every parameter. Adam adapts the step size for each weight on its own, based on that weight's gradient history.

Two Moving Averages

Adam tracks two running averages: the mean of gradients and the mean of their squares. Together they form the first and second moments.

All lessons in this course

SGD with Momentum
Adam & AdamW Explained
Weight Decay vs L2 Regularization
Learning Rate Schedules & Warmup

← Back to Deep Learning Academy