Multi-GPU Training with DataParallel
nn.DataParallel, GPU memory balancing, bandwidth bottlenecks, when DDP is better.
Why Multiple GPUs
Modern models and batches outgrow a single GPU memory and compute. Using multiple GPUs lets you train larger models or process bigger batches in less wall-clock time. PyTorch offers several ways to do this; the simplest is DataParallel.
Data Parallelism
Data parallelism replicates the model on every GPU and splits each input batch across them. Every GPU computes on its shard, then gradients are combined so all replicas stay in sync.
All lessons in this course
- Multi-GPU Training with DataParallel
- DistributedDataParallel (DDP)
- Mixed Precision Training with AMP
- Efficient Training with Hugging Face Accelerate