Data vs Model Parallelism
Two ways to split the work.
Why Scale Out at All
One GPU is fine until your model or dataset gets big. Distributed training spreads the work across many GPUs so you finish hours faster.
Two Ways to Split
There are two core strategies: split the data across GPUs, or split the model itself. Each solves a different bottleneck.
All lessons in this course
- Data vs Model Parallelism
- DistributedDataParallel Basics
- Sync Batch Norm & Sharded State
- Launch Jobs with torchrun