Gradient Accumulation for Big Batches
Simulate large batches on small GPUs.
The Big-Batch Problem
Large batches often train more smoothly, but they also need lots of GPU memory. A small card simply cannot hold a giant batch at once.
The Core Trick
Gradient accumulation splits one big batch into small chunks. You add up their gradients and update once, as if the whole batch ran together.
All lessons in this course
- Mixed Precision with autocast & GradScaler
- Gradient Accumulation for Big Batches
- Profile the Bottleneck
- Cut GPU Memory Usage