Speculative Decoding: How to Speed Up LLM Inference 3x with a Tiny Draft Model

If you are deploying large language models in production, inference latency is probably your biggest bottleneck. Every token generated requires a full forward pass through billions of parameters. But what if you could verify multiple tokens at once instead of generating them one at a time? That is exactly what speculative decoding enables — and it can accelerate generation by 2x to 4x without sacrificing output quality.

In this tutorial, we will break down how speculative decoding works, why it is mathematically sound, and how to implement it with vLLM and Hugging Face Transformers.

The Bottleneck: Autoregressive Decoding

Standard LLM inference is fundamentally sequential. Given a prompt, the model:

  1. Runs a forward pass to compute logits for the next token.
  2. Samples or greedily selects the token.
  3. Appends it to the context and repeats.

Each step requires loading all model weights from memory — a memory-bandwidth-bound operation. On a 70B model, a single token generation can take 50–100ms even on an A100 GPU. The compute units sit idle while waiting for data to arrive.

The Core Idea: Draft, Then Verify

Speculative decoding exploits a simple observation: verifying K tokens is roughly the same cost as generating one token. The forward pass dominates the cost; whether you check one token or ten during a single pass, the weight loading is already paid for.

The algorithm works in two phases:

Phase 1: Draft Generation

A small, fast "draft model" (e.g., a 125M–1B parameter model) generates γ candidate tokens autoregressively. Because the draft model is tiny, this step is very fast.

Phase 2: Parallel Verification

The large "target model" processes all γ draft tokens in a single forward pass. For each position, it computes its own distribution and checks whether the draft token matches:

  • If the target model agrees (same token sampled), the token is accepted.
  • If the target model disagrees, the first mismatch is resampled from the target distribution, and all subsequent draft tokens are discarded.

This guarantees that the output distribution is identical to what the target model would have produced alone — no approximation, no quality loss.

The Acceptance Rejection Algorithm

Here is the precise acceptance mechanism, which is what makes speculative decoding distribution-preserving:

import torch
from torch.distributions import Categorical

def speculative_accept_reject(
    draft_tokens: list[int],
    draft_probs: list[torch.Tensor],
    target_logits: torch.Tensor,
    temperature: float = 1.0,
) -> tuple[int, list[int]]:
    """
    Returns (accepted_count, final_tokens).
    draft_tokens: tokens generated by the draft model
    draft_probs: probability distributions from draft model
    target_logits: raw logits from target model at each position
    """
    accepted = 0
    target_probs = torch.softmax(target_logits / temperature, dim=-1)

    for i, (draft_token, draft_prob) in enumerate(zip(draft_tokens, draft_probs)):
        # Compute acceptance probability
        t_prob = target_probs[i, draft_token].item()
        d_prob = draft_prob[draft_token].item()
        accept_prob = min(1.0, t_prob / max(d_prob, 1e-10))

        if torch.rand(1).item() < accept_prob:
            accepted += 1  # Accept this draft token
        else:
            # Reject: resample from adjusted distribution
            adjusted = target_probs[i] - draft_prob
            adjusted = torch.clamp(adjusted, min=0)
            adjusted = adjusted / adjusted.sum()
            replacement = Categorical(adjusted).sample().item()
            return accepted, draft_tokens[:accepted] + [replacement]

    # All draft tokens accepted — sample one more from target
    final_prob = target_probs[len(draft_tokens)]
    extra = Categorical(final_prob).sample().item()
    return accepted, draft_tokens + [extra]

The key insight: when accept_prob < 1, the rejection is compensated by resampling from a corrected distribution max(q - p, 0) / sum, ensuring the marginal output distribution matches the target model exactly.

Implementing with vLLM

vLLM has first-class support for speculative decoding. Here is how to configure it:

from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Llama-3.1-70B-Instruct",
    speculative_model="JackFram/llama-68m",
    num_speculative_tokens=5,
    max_num_seqs=16,
    tensor_parallel_size=4,
    max_model_len=8192,
)

sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=512,
)

outputs = llm.generate(
    ["Explain the transformer attention mechanism"],
    sampling_params,
)

print(outputs[0].outputs[0].text)

The num_speculative_tokens=5 parameter tells vLLM to have the 68M draft model generate 5 tokens, then verify all 5 in one pass through the 70B model.

Hugging Face Transformers Implementation

Transformers also supports speculative decoding through its AssistedGeneration API:

from transformers import AutoModelForCausalLM, AutoTokenizer

target_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    device_map="auto",
    torch_dtype="auto",
)

draft_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-1B-Instruct",
    device_map="auto",
    torch_dtype="auto",
)

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")

inputs = tokenizer(
    "Write a Python function that implements binary search",
    return_tensors="pt",
).to(target_model.device)

# Use the draft model as an assistant during generation
outputs = target_model.generate(
    **inputs,
    assistant_model=draft_model,
    max_new_tokens=256,
    temperature=0.7,
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Transformers handles the acceptance-rejection logic internally. The draft model just needs to share the same tokenizer vocabulary as the target model.

Choosing the Right Draft Model

The choice of draft model significantly impacts performance:

Target ModelGood Draft ModelExpected γ=5 Acceptance
Llama-3.1-70BLlama-3.2-1B or Llama-68M70–80%
Mixtral-8x7BMistral-7B65–75%
Codex-styleStarCoder-1B60–70%

Better draft models accept more tokens but cost more to run. The sweet spot is typically a model 20–100x smaller than the target.

N-gram Speculative Decoding (No Draft Model Needed)

If you cannot deploy a second model, you can use n-gram lookup as the draft source. The idea: cache previously generated sequences and reuse them as draft tokens when the context matches:

class NGramDraftGenerator:
    def __init__(self, max_ngram: int = 8):
        self.cache: dict[str, list[int]] = {}
        self.max_ngram = max_ngram

    def update(self, tokens: list[int]):
        for n in range(2, self.max_ngram + 1):
            for i in range(len(tokens) - n):
                key = tuple(tokens[i:i+n])
                if i + n < len(tokens):
                    self.cache.setdefault(key, []).append(tokens[i+n])

    def draft(self, context: list[int], max_tokens: int = 5) -> list[int]:
        drafts = []
        current = list(context)
        for _ in range(max_tokens):
            key = tuple(current[-self.max_ngram:])
            candidates = self.cache.get(key)
            if not candidates:
                break
            next_token = max(set(candidates), key=candidates.count)
            drafts.append(next_token)
            current.append(next_token)
        return drafts

This approach is model-free but has lower acceptance rates (40–60% typically). It works well for code generation where repetitive patterns are common.

Performance Results

On a single A100 GPU with Llama-3.1-70B:

  • Standard decoding: ~15 tokens/sec
  • Speculative (γ=5, 68M draft): ~40–45 tokens/sec
  • Speculative (γ=8, 1B draft): ~50–55 tokens/sec

The speedup comes from amortizing the expensive target model forward pass across multiple tokens. More accepted tokens per pass equals higher throughput.

When Speculative Decoding Shines

  • Code generation — highly predictable token sequences yield 80%+ acceptance rates
  • Translation — deterministic output patterns favor drafting
  • Long-form writing — consistent style and vocabulary improve draft accuracy
  • Batched inference — the verification pass benefits from GPU parallelism across sequences

When It Struggles

  • Creative/temperature-heavy sampling — high temperature reduces draft-target agreement
  • Very short outputs — overhead dominates when generating few tokens
  • Highly diverse domains — if the draft model lacks domain knowledge, acceptance drops

Summary

Speculative decoding is one of the few inference optimizations that provides real speedups with zero quality degradation. By pairing a small draft model with a large target model, you can achieve 2x–4x throughput improvements on a single GPU. Both vLLM and Hugging Face Transformers make it accessible with minimal configuration changes.

The technique represents a paradigm shift: instead of accepting autoregressive decoding as inevitable, we question whether sequential token generation is truly necessary. The answer — draft in parallel, verify in parallel — opens the door to even more aggressive speculation strategies like lookahead decoding and tree-based speculative sampling.