AI Agents · Lesson

Quantisation and Speculative Decoding

For self-hosted models: int8/int4 quantization for memory, speculative decoding for throughput.

Self-Hosted Model Optimisations

For self-hosted open models, two big speed/cost wins:

Quantisation — smaller weights, less RAM/VRAM, faster inference
Speculative decoding — generate multiple tokens per step

Quantisation Basics

Models are normally stored as FP16 (16 bits per weight). Quantisation reduces to fewer bits:

FP16 — baseline
INT8 — 2x smaller, ~negligible quality loss
INT4 / Q4_K_M — 4x smaller, small quality loss
INT2 / 1.58-bit — 8x+ smaller, real quality loss

All lessons in this course

Token Budgets Per Step
Model Routing (Cheap -> Expensive)
Caching Prompts and Results (Anthropic, Vertex)
Quantisation and Speculative Decoding

← Back to AI Agents