Deep Learning Academy · Lesson

Quantization for Smaller, Faster Models

Shrink weights with int8 inference.

Smaller Weights, Faster Models

Big models are slow and heavy to serve. Quantization shrinks them by storing numbers with fewer bits, so they run faster and lighter. 📉

Models usually store weights as 32-bit floats. Quantization converts them to 8-bit integers, cutting size by roughly four times.