Vanishing & Exploding Gradients
What breaks deep nets and early fixes.
Gradients Are a Product
Through many layers, backprop multiplies many local derivatives together. The size of that long product decides whether learning works or breaks.
Multiplying Small Numbers
If each link's derivative is below one, the product shrinks fast. After many layers the gradient becomes tiny, almost zero by the time it reaches early layers.
All lessons in this course
- The Chain Rule, Layer by Layer
- Forward Caches, Backward Reuses
- Backprop a Tiny Net by Hand
- Vanishing & Exploding Gradients