CUDA Academy · Lesson

Mixed Precision: FP16, BF16, TF32

Trading precision for throughput.

Trading Bits for Speed

Tensor Cores get their speed from mixed precision: feeding smaller number formats in so the hardware can push far more math per cycle. ⚡

Standard FP32 uses 32 bits per value. It is accurate but heavy, so moving and multiplying many FP32 numbers eats bandwidth and time.