0Pricing
AI Agents · Lesson

Quantisation and Speculative Decoding

For self-hosted models: int8/int4 quantization for memory, speculative decoding for throughput.

Self-Hosted Model Optimisations

For self-hosted open models, two big speed/cost wins:

  1. Quantisation — smaller weights, less RAM/VRAM, faster inference
  2. Speculative decoding — generate multiple tokens per step

Quantisation Basics

Models are normally stored as FP16 (16 bits per weight). Quantisation reduces to fewer bits:

  • FP16 — baseline
  • INT8 — 2x smaller, ~negligible quality loss
  • INT4 / Q4_K_M — 4x smaller, small quality loss
  • INT2 / 1.58-bit — 8x+ smaller, real quality loss

All lessons in this course

  1. Token Budgets Per Step
  2. Model Routing (Cheap -> Expensive)
  3. Caching Prompts and Results (Anthropic, Vertex)
  4. Quantisation and Speculative Decoding
← Back to AI Agents