Top-p Nucleus Sampling
How top-p restricts sampling to the most probable token set.
What Is Top-p Sampling?
Top-p sampling (also called nucleus sampling) is a technique that restricts sampling to a dynamic subset of the vocabulary. Instead of sampling from all tokens, the model considers only the smallest set of tokens whose cumulative probability is at least p.
Proposed in the paper 'The Curious Case of Neural Text Degeneration' (Holtzman et al., 2019), it outperforms simple temperature scaling for diverse yet coherent generation.
How Top-p Works Step by Step
Algorithm:
- Compute softmax probabilities over the full vocabulary
- Sort tokens by probability (highest first)
- Walk down the sorted list, accumulating probability, until the cumulative sum reaches p
- This set of tokens is the nucleus
- Sample from the nucleus only (renormalize probabilities to sum to 1)
import numpy as np
def top_p_sample(logits, p=0.9):
probs = softmax(logits, temperature=1.0)
# Sort by probability descending
sorted_indices = np.argsort(probs)[::-1]
sorted_probs = probs[sorted_indices]
# Find nucleus: smallest set with cumulative prob >= p
cumulative = np.cumsum(sorted_probs)
nucleus_size = np.searchsorted(cumulative, p) + 1
nucleus_indices = sorted_indices[:nucleus_size]
nucleus_probs = sorted_probs[:nucleus_size]
# Renormalize
nucleus_probs = nucleus_probs / nucleus_probs.sum()
# Sample
chosen = np.random.choice(nucleus_indices, p=nucleus_probs)
return chosen
token = top_p_sample(logits, p=0.9)All lessons in this course
- What Is Temperature in LLMs?
- Top-p Nucleus Sampling
- Top-k Sampling
- Choosing Parameters for Your Use Case