AI Engineering Academy · Lesson

What Is a Token?

Use the tiktoken library to tokenize real text and discover how words, punctuation, and whitespace map to token sequences in different models.

LLMs Process Tokens, Not Words

When you send text to an LLM, the model does not see characters or words — it sees tokens. A token is a chunk of text that the model's vocabulary maps to a single integer ID. Tokens can be whole words, parts of words, punctuation, or whitespace depending on the tokenizer.

Understanding tokens is practically important because OpenAI's pricing is per token, the context window is measured in tokens, and the maximum response length is controlled by max_tokens. Surprises in cost and behavior almost always trace back to misunderstanding how your text tokenizes.

How Tokenization Works: BPE

GPT models use Byte Pair Encoding (BPE) tokenization. BPE starts with a vocabulary of individual bytes and iteratively merges the most frequent adjacent pairs until it reaches the desired vocabulary size. OpenAI's GPT models use a vocabulary of about 100,000 tokens.

The result is that common words become single tokens (hello is one token), while rare or made-up words are split into multiple subword tokens (subaqueous might become three tokens: sub, aque, ous). This lets the model handle any text, even words it has never seen, by composing them from smaller familiar pieces.

All lessons in this course

← Back to AI Engineering Academy