Speculative Decoding: 2–3× Faster LLM Inference Without Quality Loss

Large language models are notoriously slow at inference. Each token requires a full forward pass through billions of parameters, and autoregressive generation means you pay that cost one token at a time. But what if you could generate multiple tokens in a single forward pass while guaranteeing the exact same output as the original model?

That's the promise of speculative decoding — an inference acceleration technique that has moved from research papers to production frameworks like vLLM and Transformers in under a year. In this tutorial, we'll break down the algorithm, implement it from scratch, and show you how to deploy it with real models.

How Speculative Decoding Works

The core idea is elegant: use a small, fast draft model to guess the next few tokens, then use the large target model to verify all of them in parallel. If the draft is right, you got multiple tokens for the price of one verification pass. If it's wrong, you fall back to the target model's token — and the output distribution is mathematically identical to sampling from the target model alone.

The algorithm has two phases:

Phase 1: Draft Generation

The draft model generates γ (gamma) candidate tokens autoregressively from the current context:

x_1, x_2, ..., x_γ ~ P_draft(· | context)

Phase 2: Parallel Verification

The target model evaluates the probability of all draft tokens in a single forward pass:

P_target(x_1 | context), P_target(x_2 | context, x_1), ..., P_target(x_γ | context, x_1, ..., x_{γ-1})

For each position i, accept the draft token with probability:

min(1, P_target(x_i) / P_draft(x_i))

If rejected at position i, sample a replacement token from the residual distribution: norm(max(0, P_target - P_draft)).

This acceptance rule guarantees the output matches the target model's distribution exactly. No approximation, no quality loss.

Building a Speculative Decoder from Scratch

Let's implement this with PyTorch and Hugging Face Transformers. We'll use Meta-Llama-3-8B as the target and Meta-Llama-3-1B as the draft.

Step 1: Load the Models

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# Target model (large, slow, high quality)
target_name = "meta-llama/Meta-Llama-3-8B-Instruct"
target_model = AutoModelForCausalLM.from_pretrained(
    target_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

# Draft model (small, fast, lower quality)
draft_name = "meta-llama/Meta-Llama-3.2-1B-Instruct"
draft_model = AutoModelForCausalLM.from_pretrained(
    draft_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

tokenizer = AutoTokenizer.from_pretrained(target_name)
tokenizer.pad_token = tokenizer.eos_token

Step 2: Core Speculative Decoding Loop

import torch.nn.functional as F

def speculative_decode(target_model, draft_model, input_ids, 
                       tokenizer, gamma=4, max_new_tokens=100,
                       temperature=0.8, top_p=0.9):
    """
    Speculative decoding with parallel verification.
    
    Args:
        gamma: Number of draft tokens to generate per verification step
        max_new_tokens: Maximum tokens to generate
        temperature: Sampling temperature
        top_p: Nucleus sampling threshold
    """
    generated_tokens = input_ids.clone()
    
    with torch.no_grad():
        while len(generated_tokens[0]) - len(input_ids[0]) < max_new_tokens:
            
            # --- PHASE 1: Draft Generation ---
            draft_tokens = []
            draft_input = generated_tokens.clone()
            
            for _ in range(gamma):
                draft_logits = draft_model(draft_input).logits[:, -1, :]
                draft_probs = F.softmax(draft_logits / temperature, dim=-1)
                draft_token = torch.multinomial(draft_probs, num_samples=1)
                draft_tokens.append(draft_token)
                draft_input = torch.cat([draft_input, draft_token], dim=1)
            
            # --- PHASE 2: Parallel Verification ---
            # Run target model on all tokens at once
            all_ids = torch.cat([
                generated_tokens,
                torch.cat(draft_tokens, dim=1)
            ], dim=1)
            
            target_logits = target_model(all_ids).logits
            target_logits = target_logits[:, -gamma-1:-1, :] / temperature
            
            # Compute acceptance probabilities
            accepted = 0
            for i in range(gamma):
                draft_pos_logits = draft_logits_cache[i]  # cached from draft
                target_pos_probs = F.softmax(target_logits[:, i], dim=-1)
                draft_pos_probs = F.softmax(draft_pos_logits, dim=-1)
                
                draft_token_id = draft_tokens[i].item()
                p_target = target_pos_probs[0, draft_token_id].item()
                p_draft = draft_pos_probs[0, draft_token_id].item()
                
                accept_prob = min(1.0, p_target / (p_draft + 1e-10))
                
                if torch.rand(1).item() < accept_prob:
                    accepted += 1
                else:
                    # Sample from residual distribution
                    residual = torch.clamp(target_pos_probs - draft_pos_probs, min=0)
                    residual = residual / (residual.sum() + 1e-10)
                    replacement = torch.multinomial(residual, num_samples=1)
                    draft_tokens[i] = replacement
                    break
            
            # Append accepted tokens
            tokens_to_add = torch.cat(draft_tokens[:accepted+1], dim=1)
            generated_tokens = torch.cat([generated_tokens, tokens_to_add], dim=1)
            
            # Early exit if we've hit max tokens
            if len(generated_tokens[0]) - len(input_ids[0]) >= max_new_tokens:
                break
    
    return generated_tokens

Step 3: The Real Implementation — Using vLLM

In production, you shouldn't roll your own. vLLM has native speculative decoding support that handles KV cache management, batched requests, and GPU memory optimization:

from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    speculative_model="meta-llama/Meta-Llama-3.2-1B-Instruct",
    num_speculative_tokens=4,  # gamma
    max_model_len=4096,
    tensor_parallel_size=1,
    gpu_memory_utilization=0.9
)

sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=200
)

output = llm.generate(
    ["Explain how attention mechanisms work in transformers"],
    sampling_params
)

print(output[0].outputs[0].text)

That's it. vLLM handles the draft generation, parallel verification, and residual sampling internally. You get 2–3× throughput improvement with zero quality loss.

When Does Speculative Decoding Shine?

The speedup depends on the acceptance rate — how often the draft model guesses correctly. Higher acceptance rates mean more tokens per verification pass.

ScenarioExpected Acceptance RateSpeedup
Coding tasks (structured syntax)70–85%2.5–3.0×
Technical documentation60–75%2.0–2.5×
Creative writing40–60%1.5–2.0×
Mathematical reasoning30–50%1.3–1.8×

Best results come when the draft model is a smaller version of the same architecture family (e.g., Llama 1B drafting for Llama 8B). Cross-family pairs (e.g., Phi drafting for Mistral) see lower acceptance rates.

Advanced: EAGLE-2 — The Next Generation

Recent research has pushed beyond naive draft models. EAGLE-2 (Extrapolative Generation for Language models) replaces the autoregressive draft model with a single-step multi-token predictor that:

  • Uses the target model's internal features (not just output tokens) to predict future tokens
  • Generates all γ tokens in parallel (not autoregressively)
  • Achieves 85–90% acceptance rates on coding benchmarks
  • Delivers up to 4× speedup on batched inference
# EAGLE-2 via vLLM (requires vLLM >= 0.4.0)
llm = LLM(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    speculative_model="yuhuili/EAGLE-LLaMA3-Instruct-8B",
    num_speculative_tokens=6,
    speculative_draft_tensor_parallel_size=1,
)

EAGLE-2 is particularly effective because it learns the target model's "thinking process" rather than just approximating its output distribution.

Key Takeaways

  1. Speculative decoding is lossless — the output distribution is mathematically identical to the target model.
  2. It works with any decoder-only model — no architecture changes needed.
  3. Draft model choice matters — same-family smaller models give the best acceptance rates.
  4. Production-ready today — vLLM, Transformers, and TGI all support it out of the box.
  5. EAGLE-2 is the next frontier — moving from autoregressive drafts to feature-based prediction.

The era of "slow LLM inference" is ending. Speculative decoding costs nothing in quality and saves 40–70% in latency. If you're running LLMs in production and not using it, you're leaving free performance on the table.

Want to experiment? Start with vLLM's built-in support, pick a same-family draft model, and measure your own speedup. The code is already there — you just need to flip the switch.