§ 5.3 Tokenization Deep Dive
BPE, WordPiece, Unigram LM, byte-level encoding, Tiktoken — how raw text becomes integer IDs and why that step shapes everything downstream.
1. Overview
Before a transformer sees a single character of text, that text has been converted to a sequence of integer token IDs. The tokenizer decides the vocabulary size V, the sequence length T, and which surface forms are treated as atomic units. Every downstream property — generation speed, multilingual fairness, arithmetic reasoning, context-window utilisation — traces back to this invisible pre-processing step. Three subword families dominate: BPE (used by GPT-2 through GPT-4 and LLaMA 1/2), WordPiece (BERT), and Unigram LM (T5, LLaMA 3 via SentencePiece).
The choice of tokenizer is not cosmetic. LLaMA 3 upgraded from a 32 000-token SentencePiece vocabulary (LLaMA 2) to 128 000 tokens specifically to improve efficiency on code and multilingual text — the same document produces roughly 3× fewer tokens, directly enabling longer effective context for the same sequence length budget.
2. Vocab Size vs Sequence Length
Vocabulary size and sequence length are in tension: a small vocabulary keeps the embedding table small but forces long token sequences (more attention FLOPs, worse long-range memory); a large vocabulary compresses sequences but increases embedding-table memory and softmax cost, and every rare token gets fewer gradient updates. The practical sweet spot is 30 000 – 128 000 subword tokens.
| Model | Tokenizer | Vocab size | Notes |
|---|---|---|---|
| GPT-2 | Byte-level BPE | 50 257 | r50k_base; 40 000 merges + 256 bytes + <|endoftext|> |
| GPT-3 / InstructGPT | Byte-level BPE | 50 257 | Same r50k_base vocab as GPT-2 |
| GPT-3.5 / GPT-4 | Byte-level BPE | 100 277 | cl100k_base; 100 000 merges + <|endoftext|> specials |
| GPT-4o / o-series | Byte-level BPE | 200 019 | o200k_base; more efficient for code + multilingual |
| BERT (base/large) | WordPiece | 30 522 | ## prefix for continuation tokens; [UNK] still present |
| T5 | Unigram LM (SP) | 32 000 | SentencePiece; sentence-piece prefix ▁ |
| LLaMA 1 / 2 | Unigram LM (SP) | 32 000 | SentencePiece; separate byte fallback tokens |
| LLaMA 3 / 3.1 | Byte-level BPE (Tiktoken) | 128 000 | 128 000 base + 256 special tokens; tiktoken format |
| Mistral / Mixtral | Unigram LM (SP) | 32 000 | Same as LLaMA 2 tokenizer |
| DeepSeek-V3 | Byte-level BPE | 102 400 | 100 000 BPE merges + extra special tokens |
| Qwen 2.5 | Byte-level BPE (Tiktoken) | 151 936 | Large vocab improves code + multilingual efficiency |
3. BPE — Byte Pair Encoding
Sennrich et al. (2016, arXiv:1508.07909) adapted the data-compression BPE algorithm for NLP: start with a vocabulary of individual characters and iteratively merge the most frequently co-occurring adjacent pair into a new symbol. After N merges the vocabulary has |chars| + N symbols. Encoding a new word applies these merges in order, greedily, left to right.
Core Mechanism — BPE Training
Background: We have a corpus of words with frequencies. Rare words share substrings with common words (e.g., "newest" shares "est" with "widest"). BPE finds these shared structures bottom-up by frequency.
Plan:
- Split every word in the corpus into characters; append an end-of-word marker
</w>. Count word frequencies. - Count every adjacent pair of symbols across all word instances (weighted by word frequency).
- Merge the highest-frequency pair into a new symbol everywhere in the corpus.
- Add the new symbol to the vocabulary. Repeat until target vocabulary size is reached.
- To encode a new word: apply the learned merge rules in order, greedily.
Example walkthrough — initial corpus: {low×5, lower×2, newest×6, widest×3}
| Merge # | Best pair | Freq | New symbol | State of "newest" |
|---|---|---|---|---|
| 0 | — | — | — | n e w e s t </w> |
| 1 | (e, s) | 9 | es | n e w es t </w> |
| 2 | (es, t) | 9 | est | n e w est </w> |
| 3 | (est, </w>) | 9 | est</w> | n e w est</w> |
| 4 | (n, e) | 6 | ne | ne w est</w> |
| 5 | (ne, w) | 6 | new | new est</w> |
| 6 | (new, est</w>) | 6 | newest</w> | newest</w> |
After just 6 merges, "newest" is a single token — the model doesn't need to reconstruct it from parts. But crucially, "est" and "new" are also tokens in the vocabulary, so rare words like "tallest" or "newer" get reasonable segmentations even if they were absent from training data.
BPE Encoding at Inference
# Encoding "lower" given merges from the example above: Initial: l o w e r </w> Merge 1 (e,s)→es: no change (no 'e','s' adjacent in "lower") Merge 2 (es,t)→est: no change Merge 3 (est,</w>)→est</w>: no change Merge 7 (l,o)→lo: lo w e r </w> Merge 8 (lo,w)→low: low e r </w> Merge 9 (e,r)→er: low er </w> Merge 10 (er,</w>)→er</w>: low er</w> Final tokens: ["low", "er</w>"] — 2 tokens
4. WordPiece — BERT's Tokenizer
Schuster & Nakamura (2012) and Wu et al. (2016, arXiv:1609.08144) introduced WordPiece for Google's neural machine translation. The training algorithm differs from BPE in how it scores merge candidates: instead of raw pair frequency, it uses a likelihood gain criterion — merge the pair (A, B) that most increases the language-model log-likelihood of the corpus:
BPE score(A, B) = count(AB) WordPiece score(A, B) = count(AB) / (count(A) × count(B)) Intuition: WordPiece penalises merging two very frequent tokens (e.g., "th" + "e" → "the"), preferring pairs where co-occurrence exceeds the independence baseline — i.e., tokens that are strongly mutually predictive.
At encoding time, WordPiece uses a longest-match-first strategy: find the longest prefix of the remaining string that is in the vocabulary; emit it; repeat. Continuation tokens (not at the start of a word) are prefixed with ## to distinguish from initial tokens.
BERT tokenizes "unbelievability": un ##believ ##abil ##ity ↑ word start ↑ continuations prefixed with ## BERT vocab (30 522 tokens) is fixed — modern practice is to retrain WordPiece from scratch for each domain or language.
5. Unigram LM — SentencePiece
Kudo (2018, arXiv:1804.10959) inverts the BPE/WordPiece approach: instead of starting small and growing, start with a large vocabulary (all possible substrings up to some length, plus the full words) and prune it. At each step, remove the token whose removal causes the smallest increase in total corpus negative log-likelihood under a unigram language model. The EM algorithm alternates between:
- E-step: Compute each word's optimal segmentation using the Viterbi algorithm over the current vocabulary probabilities.
- M-step: Update unigram probabilities from the expected token counts across all segmentations.
- Prune the bottom η% (typically 20%) of tokens that would be missed least if removed. Repeat.
The key advantage: because the model is probabilistic, multiple segmentations of the same word are valid. During training, the model can randomly sample a non-optimal segmentation (subword regularisation), acting as a form of data augmentation that makes the model more robust to how words are split.
▁ (U+2581) to mark word boundaries, making the tokenizer language-agnostic (no whitespace assumptions for Chinese/Japanese).6. Byte-Level BPE — GPT-2 Onward
Radford et al. (GPT-2, 2019) changed the base alphabet from characters (Unicode codepoints) to bytes. Every UTF-8 string can be expressed as a sequence of 256 possible bytes, so the model can always represent any text — no [UNK] token is ever needed. BPE merges are then performed on byte sequences rather than character sequences.
GPT-2 tokenizer (r50k_base): Base vocab: 256 bytes Plus: 40 000 BPE merges Total: 50 256 + 1 special (<|endoftext|>) = 50 257 tokens cl100k_base (GPT-3.5 / GPT-4): Total: 100 277 tokens (100 000 merges) Splits digits individually: "100" → ["1","0","0"] ← intentional o200k_base (GPT-4o / o-series): Total: 200 019 tokens More aggressive merging for code and multilingual text
Tiktoken (OpenAI) is the fast Rust-backed Python library that encodes text using these vocabularies. It is roughly 3–6× faster than the HuggingFace Python tokenizer for the same vocabulary due to the Rust core and regex-based pre-tokenization.
7. BPE vs WordPiece vs Unigram — Side-by-Side
| Property | BPE | WordPiece | Unigram LM |
|---|---|---|---|
| Training start | Characters | Characters + ## prefix | All possible substrings |
| Merge criterion | max count(A,B) | max count(AB)/(count(A)×count(B)) | min ΔNLLcorpus |
| Training direction | Bottom-up (add) | Bottom-up (add) | Top-down (prune) |
| Encoding | Apply merges L→R in order | Longest-match-first | Viterbi / greedy argmax |
| Multiple segs? | No (deterministic) | No (deterministic) | Yes (sample for regularisation) |
| UNK handling | Byte-level variant: none | Still has [UNK] | Byte fallback in SP |
| Word boundary | End-of-word token </w> | ## prefix for continuation | ▁ prefix for word start |
| Example: 'unbelievability' | un bel iev abil ity | un ##believ ##abil ##ity | ▁un believ ability |
8. Tokenization Pitfalls
8.1 Leading-space sensitivity
GPT-2 / GPT-4 tokenizers attach leading whitespace to word tokens. "cat" and " cat" are different token IDs. This means the token stream encodes position within a sentence implicitly, but it also means prompts must match training formatting exactly or the model sees an out-of-distribution token.
tiktoken.encode("cat") → [9246] # no leading space
tiktoken.encode(" cat") → [3797] # leading space: different ID!
tiktoken.encode("Cat") → [35, 265] # capital C splits differently8.2 Digit splitting
cl100k_base (GPT-4) deliberately splits each digit into its own token. "12345" becomes ["1","2","3","4","5"]. This was an intentional design choice to let the model learn positional arithmetic more easily — the model can attend to individual digit positions. However it also means long numbers tokenise very inefficiently.
8.3 Multilingual fairness
A tokenizer trained on English-dominant data produces short tokens for common English words but long character sequences for non-Latin scripts. A study by Ahia et al. (2023) found that Yoruba requires ~7× more tokens than English for the same amount of information — meaning lower-resource languages are penalised in context-window utilisation and inference cost. LLaMA 3's 128K vocabulary was partly motivated by closing this gap.
| Language | Text (same meaning) | cl100k tokens | Relative cost |
|---|---|---|---|
| English | "The cat sat on the mat" | 7 | 1× |
| French | "Le chat était assis sur le tapis" | 9 | 1.3× |
| Chinese | "猫坐在垫子上" | 7 | 1× |
| Arabic | "جلس القط على الحصيرة" | 14 | 2× |
| Yoruba | "Ológìnni jókòó lórí ìtẹ̀" | 48 | ≈7× |
8.4 Code tokenization
Indentation-heavy languages (Python, YAML) tokenise poorly: 4 spaces may not be a single token, meaning the model must count tokens rather than reading structural whitespace. cl100k and o200k improved this by merging common indentation sequences. In CodeLLaMA and DeepSeek-Coder, the tokenizer was specifically retrained with code-heavy corpora.
9. Minimal Demos
Demo A — BPE from scratch: char vs word vs BPE token counts
Trains 20 BPE merges on a tiny English corpus, then tokenizes several test words. Watch how "lowest" collapses from 6 character tokens to 2 BPE tokens.
Demo B — Train BPE with configurable vocab size
Enter a target vocabulary size. The demo shows every merge step, then applies the learned rules to tokenize "newest", "widest", and "lower". Try 15 vs 30 to see how word-complete tokens emerge as merges accumulate.
10. Production / Source Pointers
| Library / File | Key symbol | What it does |
|---|---|---|
| tiktoken/core.py | Encoding.encode() | Fast BPE encoding via Rust core; supports cl100k/o200k/r50k |
| tiktoken/_educational.py | SimpleBytePairEncoding | Pure-Python reference implementation of GPT-2 BPE (great reading) |
| huggingface/tokenizers (Rust) | BpeTrainer, models/BPE | Training and encoding BPE; tokenizers.train_from_iterator() |
| huggingface/tokenizers | models/WordPiece, WordPieceTrainer | BERT-style WordPiece train + encode |
| google/sentencepiece | SentencePieceTrainer.train() | Train Unigram LM or BPE tokenizer from raw text corpus |
| sentencepiece/src/bpe_model.cc | BpeModel::Encode() | C++ implementation of greedy BPE merge at inference time |
| transformers/tokenization_utils_fast.py | PreTrainedTokenizerFast | Wraps HF tokenizers Rust core; used by all modern HF models |
| Karpathy/minbpe (GitHub) | regex.py, basic.py | Pedagogical BPE from scratch; highly readable ~300 lines total |
11. References
Papers
- Sennrich, Haddow, Birch (2016) — Neural Machine Translation of Rare Words with Subword Units (BPE for NLP). arXiv:1508.07909
- Wu et al. (2016) — Google's Neural Machine Translation System (WordPiece). arXiv:1609.08144
- Kudo (2018) — Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates (Unigram LM). arXiv:1804.10959
- Kudo & Richardson (2018) — SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing. arXiv:1808.06226
- Radford et al. (2019) — Language Models are Unsupervised Multitask Learners (GPT-2; introduces byte-level BPE). OpenAI blog.
- Ahia et al. (2023) — Do All Languages Cost the Same? Tokenization in the Era of Commercial Language Models. arXiv:2305.13707 (multilingual fairness)
- Rust et al. (2021) — How Good is Your Tokenizer? arXiv:2012.15613
Lectures
- Stanford CS224n (Manning) — Lecture 2: Word Vectors; Lecture 12: Subword Models
- Hugging Face NLP Course — Chapter 6: The Tokenizers Library (covers BPE / WordPiece / Unigram training hands-on)
- Karpathy — Let's build the GPT Tokenizer (YouTube, 2024, ~2 hrs; implements tiktoken-compatible BPE from scratch)
- CMU 11-711 (Graham Neubig) — Advanced NLP, Lecture on Tokenization and Subword Models
Textbooks
- Jurafsky & Martin — Speech and Language Processing, 3rd ed. (draft), Ch. 2: Regular Expressions, Text Normalization, Edit Distance
- Eisenstein — Introduction to Natural Language Processing, Ch. 2: Linear Text Classification (covers tokenization, n-grams)
Code / Repos
openai/tiktoken— Production BPE tokenizer (Rust + Python); cl100k / o200k / r50k vocabskarpathy/minbpe— Minimal BPE from scratch in ~300 lines; ideal for learninghuggingface/tokenizers— Rust-backed tokenizer library; BPE / WordPiece / Unigram all supportedgoogle/sentencepiece— C++ library for Unigram LM and BPE; used by T5, LLaMA 2, Mistral
Blog Posts
- Hugging Face — Summary of the Tokenizers (comprehensive overview of all three algorithms with worked examples)
- Lena Voita — NLP Course: Text Classification → Subword Tokenization (interactive visualisations)
- Jay Alammar — The Illustrated GPT-2 (Section on tokenization pipeline)
- Lei Mao — Byte Pair Encoding (step-by-step with implementation)
12. Interview Prep
1. Why does BPE produce a more efficient tokenization than word-level for English?
Word-level vocabularies need millions of entries to cover rare words, conjugations, and compounds — and still produce UNK for anything unseen. BPE shares tokens across morphologically related words ("run", "running", "runner" all use subword "run"), reducing vocabulary size by 10-100× while maintaining near-zero OOV. It also handles code, URLs, and multilingual text without a separate OOV mechanism.
2. What is the exact difference between BPE and WordPiece in how they choose which pair to merge?
BPE merges the pair with the highest raw co-occurrence count: score(A,B) = count(AB). WordPiece normalises by individual token counts: score(A,B) = count(AB) / (count(A) × count(B)), which is a mutual-information-like ratio. WordPiece will not merge two very common tokens (like 'th' + 'e') unless they co-occur far more than would be expected by independence — it prefers linguistically meaningful units.
3. Compute the perplexity of a uniform distribution over a 50 257-token vocabulary.
A uniform LM assigns probability 1/V = 1/50257 to every token. Perplexity = exp(H) where H = log(V) = log(50257) ≈ 10.83 nats = 15.6 bits. So perplexity ≈ exp(10.83) ≈ 50 257. Perplexity of a uniform distribution equals the vocabulary size.
4. Why does cl100k_base (GPT-4) split numbers digit-by-digit, and what problem does that solve?
Splitting "12345" into ["1","2","3","4","5"] gives the model explicit positional access to each digit, making column-aligned arithmetic easier to learn. When numbers were merged (e.g., "100" as one token), addition required the model to memorize digit compositions internally; digit-splitting moves this burden to the attention mechanism, which handles positional reasoning better.
5. What is byte-level BPE and why does it eliminate the UNK token?
Byte-level BPE uses the 256 possible byte values (0x00–0xFF) as the base vocabulary instead of Unicode characters. Any UTF-8 string — including emoji, CJK characters, arbitrary code, and even corrupted text — can be expressed as a sequence of bytes. Since all 256 bytes are in the vocabulary, no input can produce an UNK token. Merges then happen on byte pairs, gradually combining common byte sequences into whole words or subwords.
6. How does Unigram LM tokenization differ architecturally from BPE, and what is subword regularisation?
BPE builds vocabulary bottom-up by greedy frequency merges; encoding is deterministic (apply merges in order). Unigram LM starts with a large over-complete vocabulary and prunes it top-down by EM, assigning probabilities to each token. Encoding uses Viterbi (or sampling) over these probabilities. Subword regularisation exploits that multiple valid segmentations exist: during training, randomly sample a non-MAP segmentation, exposing the model to different splits of the same word — acting as data augmentation that makes the model robust to tokenization choices.
7. Why is tokenization unfair across languages? Name one mitigation.
Tokenizers trained on English-heavy corpora learn long merged tokens for English words but fall back to individual bytes/chars for low-resource scripts — effectively penalising those languages with 3–10× more tokens for the same semantic content. Mitigations: (1) train the tokenizer on a balanced multilingual corpus (LLaMA 3's 128K tokenizer); (2) oversample underrepresented scripts in the merge corpus; (3) use byte-fallback to ensure coverage while adding language-specific high-frequency tokens.
8. What is the SentencePiece ▁ (U+2581) prefix, and how does it differ from BERT's ## marker?
SentencePiece adds ▁ (lower one-eighth block) as a word-start marker: a token ▁cat appears at the start of a word, cat without ▁ is a continuation. This is language-agnostic — no whitespace required, so Chinese/Japanese work without changes. BERT inverts this: tokens without ## are word-initial, ## prefix marks continuation. Both encode the same information (word boundary) but with opposite polarity.