Speculative Decoding: Accelerating LLM Inference by 2-3x Without Quality Loss

If you've deployed large language models in production, you know the bottleneck: autoregressive decoding generates one token at a time. Every token requires a full forward pass through the entire model. For a 70B-parameter model, that's expensive — and it gets worse the longer the output.

Speculative decoding changes this game. Instead of generating one token per forward pass, you generate multiple candidate tokens using a small "draft" model, then verify them all in parallel with the large model. The result? 2-3x speedups with zero quality degradation — the output distribution is mathematically identical to the original model.

This tutorial walks you through the algorithm, implementation details, and how to build a production-ready speculative decoder from scratch.

How Speculative Decoding Works

The core idea is elegantly simple:

  1. Draft phase: A small, fast model generates γ candidate tokens autoregressively.
  2. Verify phase: The large target model processes all γ tokens in one forward pass, computing acceptance probabilities for each.
  3. Accept or reject: Each candidate is accepted with probability min(1, p_target / p_draft). On rejection, a single corrected token is sampled and the process restarts.

Because verification is parallel, you effectively get γ tokens for roughly 2 forward passes (one draft + one verify) instead of γ passes. With a good draft model, acceptance rates of 60-80% are common.

The Math Behind It

The acceptance criterion guarantees the output distribution matches the target model exactly. For each draft token x_t with draft probability q(x_t) and target probability p(x_t):

accept_prob = min(1, p(x_t) / q(x_t))

If accepted, the token stands. If rejected, we sample from the residual distribution:

residual(x) = max(0, p(x) - q(x)) / sum(max(0, p(x) - q(x)))

This is not an approximation — it's an exact sampling algorithm. The speculative output is distributionally identical to running the target model alone.

Implementation with vLLM

vLLM provides first-class support for speculative decoding. Here's how to set it up:

from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Meta-Llama-3-70B-Instruct",
    speculative_model="JackFram/llama-68m",
    num_speculative_tokens=5,
    tensor_parallel_size=4,
)

sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=512,
)

outputs = llm.generate(
    "Explain the time complexity of the attention mechanism",
    sampling_params,
)
print(outputs[0].outputs[0].text)

The num_speculative_tokens parameter controls γ. Start with 4-6 and tune based on your draft model's acceptance rate.

Custom Speculative Decoding in PyTorch

For educational purposes (and when you need fine-grained control), here's a from-scratch implementation:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

@torch.no_grad()
def speculative_decode(
    draft_model, target_model, tokenizer,
    prompt: str, max_new_tokens: int,
    gamma: int = 5, temperature: float = 0.7,
):
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids
    output_ids = input_ids.clone()
    
    generated = 0
    while generated < max_new_tokens:
        remaining = max_new_tokens - generated
        k = min(gamma, remaining)
        
        # Phase 1: Draft model generates k tokens
        draft_ids = input_ids.clone()
        candidates = []
        for _ in range(k):
            draft_logits = draft_model(draft_ids).logits[:, -1, :] / temperature
            draft_probs = torch.softmax(draft_logits, dim=-1)
            next_token = torch.multinomial(draft_probs, num_samples=1)
            candidates.append(next_token.item())
            draft_ids = torch.cat([draft_ids, next_token], dim=1)
        
        # Phase 2: Target model verifies all k tokens in one pass
        verify_input = torch.cat([input_ids, 
            torch.tensor([candidates]).to(input_ids.device)], dim=1)
        target_logits = target_model(verify_input).logits
        target_probs = torch.softmax(
            target_logits[:, -k-1:-1, :] / temperature, dim=-1)
        
        # Phase 3: Accept/reject each token
        accepted = 0
        for i in range(k):
            draft_p = torch.softmax(
                draft_model(draft_ids[:, :i+1+len(input_ids)]
                ).logits[:, -1, :] / temperature, dim=-1
            )[0, candidates[i]]
            target_p = target_probs[0, i, candidates[i]]
            
            if torch.rand(1).item() < min(1.0, target_p / draft_p):
                accepted += 1
            else:
                # Resample from residual
                residual = torch.clamp(target_probs[0, i] - draft_p, min=0)
                residual = residual / residual.sum()
                candidates[i] = torch.multinomial(residual, num_samples=1).item()
                break
        
        output_ids = torch.cat([output_ids, 
            torch.tensor(candidates[:accepted+1]).view(1, -1).to(output_ids.device)], dim=1)
        input_ids = draft_ids[:, :len(input_ids) + accepted]
        generated += accepted + 1
    
    return tokenizer.decode(output_ids[0], skip_special_tokens=True)

Choosing the Right Draft Model

The draft model is critical. Here are proven pairings:

Target ModelRecommended DraftTypical γ
Llama-3-70BLlama-3-8B4-6
Llama-3-70BMedusa-style heads (same model)5-7
Mixtral-8x7BMistral-7B3-5
Qwen2.5-72BQwen2.5-7B4-6

Key guidelines:

  • Same architecture family works best — the token distributions align better.
  • 10-15% size ratio of target is a good starting point.
  • Medusa heads (additional LM heads trained on the same model) avoid loading a second model entirely.

Eagle: Speculative Decoding Without a Draft Model

Eagle (Extrapolation-based Augmented Generation for Language Models) takes a different approach. Instead of a separate draft model, it learns to predict hidden states of the target model from previous tokens, then uses a small head to generate candidates from those predicted states.

from vllm import LLM

llm = LLM(
    model="meta-llama/Meta-Llama-3-70B-Instruct",
    speculative_model="yuhuili/EAGLE-LLaMA3-Instruct-70B",
    num_speculative_tokens=6,
)
# Eagle typically achieves 70-85% acceptance rates on Llama-3-70B

Eagle avoids the memory overhead of keeping two models loaded and often achieves higher acceptance rates than traditional draft-then-verify approaches.

Production Considerations

1. Batching and Throughput

Speculative decoding shines with single-request latency. For high-throughput scenarios with many concurrent requests, the verification step's parallelism means you still benefit, but the speedup per-request may be lower. Profile your specific workload.

2. KV Cache Management

When tokens are rejected, their KV cache entries must be discarded. vLLM handles this automatically, but if implementing manually, ensure your cache management correctly tracks which positions are committed vs. speculative.

3. Memory Overhead

Loading a draft model adds VRAM. A 7B draft model alongside a 70B target uses roughly 80GB — within a single H100 but tight on A100 80GB. Consider model offloading or using Medusa/Eagle approaches to avoid a second model.

4. Temperature Interaction

At low temperatures (greedy/near-greedy decoding), acceptance rates increase significantly because the draft and target models agree more on high-probability tokens. At very high temperatures, acceptance rates drop.

Performance Benchmarks

On a single H100 with Llama-3-70B:

  • Standard decoding: ~15 tokens/sec
  • + Llama-3-8B draft (γ=5): ~35 tokens/sec (2.3x)
  • + Eagle-70B (γ=6): ~42 tokens/sec (2.8x)

These numbers assume typical English text generation with temperature=0.7. Code generation and structured outputs often see higher speedups because they're more predictable.

When Speculative Decoding Falls Short

Be aware of the limitations:

  • Creative/unpredictable outputs — high-temperature sampling reduces acceptance rates.
  • Very short generations — overhead dominates when generating fewer than ~20 tokens.
  • Multi-modal models — speculative decoding for vision-language models is still an active research area.

Conclusion

Speculative decoding is one of the few inference optimizations that gives you free speedups — no approximation, no quality loss, just parallel verification of smart guesses. If you're running LLMs in production and not using it, you're likely leaving 2-3x performance on the table.

The technique is mature enough for production: vLLM, TGI, and TensorRT-LLM all support it. Start with a same-family draft model, tune γ for your workload, and measure the acceptance rate. If it's above 50%, you're winning.