Speculative Decoding: 2–3× Faster LLM Inference Without Quality Loss
Large language models are notoriously slow at inference. Each token requires a full forward pass through billions of parameters, and autoregressive generation means you pay that cost one token at a time. But what if you could generate multiple tokens in a single forward pass while guaranteeing the exact same output as the original model?
That's the promise of speculative decoding — an inference acceleration technique that has moved from research papers to production frameworks like vLLM and Transformers in under a year. In this tutorial, we'll break down the algorithm, implement it from scratch, and show you how to deploy it with real models.
How Speculative Decoding Works
The core idea is elegant: use a small, fast draft model to guess the next few tokens, then use the large target model to verify all of them in parallel. If the draft is right, you got multiple tokens for the price of one verification pass. If it's wrong, you fall back to the target model's token — and the output distribution is mathematically identical to sampling from the target model alone.
The algorithm has two phases:
Phase 1: Draft Generation
The draft model generates γ (gamma) candidate tokens autoregressively from the current context:
x_1, x_2, ..., x_γ ~ P_draft(· | context)
Phase 2: Parallel Verification
The target model evaluates the probability of all draft tokens in a single forward pass:
P_target(x_1 | context), P_target(x_2 | context, x_1), ..., P_target(x_γ | context, x_1, ..., x_{γ-1})
For each position i, accept the draft token with probability:
min(1, P_target(x_i) / P_draft(x_i))
If rejected at position i, sample a replacement token from the residual distribution: norm(max(0, P_target - P_draft)).
This acceptance rule guarantees the output matches the target model's distribution exactly. No approximation, no quality loss.
Building a Speculative Decoder from Scratch
Let's implement this with PyTorch and Hugging Face Transformers. We'll use Meta-Llama-3-8B as the target and Meta-Llama-3-1B as the draft.
Step 1: Load the Models
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
# Target model (large, slow, high quality)
target_name = "meta-llama/Meta-Llama-3-8B-Instruct"
target_model = AutoModelForCausalLM.from_pretrained(
target_name,
torch_dtype=torch.float16,
device_map="auto"
)
# Draft model (small, fast, lower quality)
draft_name = "meta-llama/Meta-Llama-3.2-1B-Instruct"
draft_model = AutoModelForCausalLM.from_pretrained(
draft_name,
torch_dtype=torch.float16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(target_name)
tokenizer.pad_token = tokenizer.eos_token
Step 2: Core Speculative Decoding Loop
import torch.nn.functional as F
def speculative_decode(target_model, draft_model, input_ids,
tokenizer, gamma=4, max_new_tokens=100,
temperature=0.8, top_p=0.9):
"""
Speculative decoding with parallel verification.
Args:
gamma: Number of draft tokens to generate per verification step
max_new_tokens: Maximum tokens to generate
temperature: Sampling temperature
top_p: Nucleus sampling threshold
"""
generated_tokens = input_ids.clone()
with torch.no_grad():
while len(generated_tokens[0]) - len(input_ids[0]) < max_new_tokens:
# --- PHASE 1: Draft Generation ---
draft_tokens = []
draft_input = generated_tokens.clone()
for _ in range(gamma):
draft_logits = draft_model(draft_input).logits[:, -1, :]
draft_probs = F.softmax(draft_logits / temperature, dim=-1)
draft_token = torch.multinomial(draft_probs, num_samples=1)
draft_tokens.append(draft_token)
draft_input = torch.cat([draft_input, draft_token], dim=1)
# --- PHASE 2: Parallel Verification ---
# Run target model on all tokens at once
all_ids = torch.cat([
generated_tokens,
torch.cat(draft_tokens, dim=1)
], dim=1)
target_logits = target_model(all_ids).logits
target_logits = target_logits[:, -gamma-1:-1, :] / temperature
# Compute acceptance probabilities
accepted = 0
for i in range(gamma):
draft_pos_logits = draft_logits_cache[i] # cached from draft
target_pos_probs = F.softmax(target_logits[:, i], dim=-1)
draft_pos_probs = F.softmax(draft_pos_logits, dim=-1)
draft_token_id = draft_tokens[i].item()
p_target = target_pos_probs[0, draft_token_id].item()
p_draft = draft_pos_probs[0, draft_token_id].item()
accept_prob = min(1.0, p_target / (p_draft + 1e-10))
if torch.rand(1).item() < accept_prob:
accepted += 1
else:
# Sample from residual distribution
residual = torch.clamp(target_pos_probs - draft_pos_probs, min=0)
residual = residual / (residual.sum() + 1e-10)
replacement = torch.multinomial(residual, num_samples=1)
draft_tokens[i] = replacement
break
# Append accepted tokens
tokens_to_add = torch.cat(draft_tokens[:accepted+1], dim=1)
generated_tokens = torch.cat([generated_tokens, tokens_to_add], dim=1)
# Early exit if we've hit max tokens
if len(generated_tokens[0]) - len(input_ids[0]) >= max_new_tokens:
break
return generated_tokens
Step 3: The Real Implementation — Using vLLM
In production, you shouldn't roll your own. vLLM has native speculative decoding support that handles KV cache management, batched requests, and GPU memory optimization:
from vllm import LLM, SamplingParams
llm = LLM(
model="meta-llama/Meta-Llama-3-8B-Instruct",
speculative_model="meta-llama/Meta-Llama-3.2-1B-Instruct",
num_speculative_tokens=4, # gamma
max_model_len=4096,
tensor_parallel_size=1,
gpu_memory_utilization=0.9
)
sampling_params = SamplingParams(
temperature=0.7,
top_p=0.9,
max_tokens=200
)
output = llm.generate(
["Explain how attention mechanisms work in transformers"],
sampling_params
)
print(output[0].outputs[0].text)
That's it. vLLM handles the draft generation, parallel verification, and residual sampling internally. You get 2–3× throughput improvement with zero quality loss.
When Does Speculative Decoding Shine?
The speedup depends on the acceptance rate — how often the draft model guesses correctly. Higher acceptance rates mean more tokens per verification pass.
| Scenario | Expected Acceptance Rate | Speedup |
|---|---|---|
| Coding tasks (structured syntax) | 70–85% | 2.5–3.0× |
| Technical documentation | 60–75% | 2.0–2.5× |
| Creative writing | 40–60% | 1.5–2.0× |
| Mathematical reasoning | 30–50% | 1.3–1.8× |
Best results come when the draft model is a smaller version of the same architecture family (e.g., Llama 1B drafting for Llama 8B). Cross-family pairs (e.g., Phi drafting for Mistral) see lower acceptance rates.
Advanced: EAGLE-2 — The Next Generation
Recent research has pushed beyond naive draft models. EAGLE-2 (Extrapolative Generation for Language models) replaces the autoregressive draft model with a single-step multi-token predictor that:
- Uses the target model's internal features (not just output tokens) to predict future tokens
- Generates all γ tokens in parallel (not autoregressively)
- Achieves 85–90% acceptance rates on coding benchmarks
- Delivers up to 4× speedup on batched inference
# EAGLE-2 via vLLM (requires vLLM >= 0.4.0)
llm = LLM(
model="meta-llama/Meta-Llama-3-8B-Instruct",
speculative_model="yuhuili/EAGLE-LLaMA3-Instruct-8B",
num_speculative_tokens=6,
speculative_draft_tensor_parallel_size=1,
)
EAGLE-2 is particularly effective because it learns the target model's "thinking process" rather than just approximating its output distribution.
Key Takeaways
- Speculative decoding is lossless — the output distribution is mathematically identical to the target model.
- It works with any decoder-only model — no architecture changes needed.
- Draft model choice matters — same-family smaller models give the best acceptance rates.
- Production-ready today — vLLM, Transformers, and TGI all support it out of the box.
- EAGLE-2 is the next frontier — moving from autoregressive drafts to feature-based prediction.
The era of "slow LLM inference" is ending. Speculative decoding costs nothing in quality and saves 40–70% in latency. If you're running LLMs in production and not using it, you're leaving free performance on the table.
Want to experiment? Start with vLLM's built-in support, pick a same-family draft model, and measure your own speedup. The code is already there — you just need to flip the switch.