Quantisation and Speculative Decoding
For self-hosted models: int8/int4 quantization for memory, speculative decoding for throughput.
Self-Hosted Model Optimisations
For self-hosted open models, two big speed/cost wins:
- Quantisation — smaller weights, less RAM/VRAM, faster inference
- Speculative decoding — generate multiple tokens per step
Quantisation Basics
Models are normally stored as FP16 (16 bits per weight). Quantisation reduces to fewer bits:
- FP16 — baseline
- INT8 — 2x smaller, ~negligible quality loss
- INT4 / Q4_K_M — 4x smaller, small quality loss
- INT2 / 1.58-bit — 8x+ smaller, real quality loss
All lessons in this course
- Token Budgets Per Step
- Model Routing (Cheap -> Expensive)
- Caching Prompts and Results (Anthropic, Vertex)
- Quantisation and Speculative Decoding