Quantization for Smaller, Faster Models
Shrink weights with int8 inference.
Smaller Weights, Faster Models
Big models are slow and heavy to serve. Quantization shrinks them by storing numbers with fewer bits, so they run faster and lighter. 📉
Float32 vs Int8
Models usually store weights as 32-bit floats. Quantization converts them to 8-bit integers, cutting size by roughly four times.
All lessons in this course
- TorchScript & torch.compile
- Export to ONNX
- Quantization for Smaller, Faster Models
- Serve with FastAPI