Part V — NLP Foundations

§ 5.3 Tokenization Deep Dive

BPE, WordPiece, Unigram LM, byte-level encoding, Tiktoken — how raw text becomes integer IDs and why that step shapes everything downstream.

1. Overview

Before a transformer sees a single character of text, that text has been converted to a sequence of integer token IDs. The tokenizer decides the vocabulary size V, the sequence length T, and which surface forms are treated as atomic units. Every downstream property — generation speed, multilingual fairness, arithmetic reasoning, context-window utilisation — traces back to this invisible pre-processing step. Three subword families dominate: BPE (used by GPT-2 through GPT-4 and LLaMA 1/2), WordPiece (BERT), and Unigram LM (T5, LLaMA 3 via SentencePiece).

The choice of tokenizer is not cosmetic. LLaMA 3 upgraded from a 32 000-token SentencePiece vocabulary (LLaMA 2) to 128 000 tokens specifically to improve efficiency on code and multilingual text — the same document produces roughly 3× fewer tokens, directly enabling longer effective context for the same sequence length budget.

2. Vocab Size vs Sequence Length

Vocabulary size and sequence length are in tension: a small vocabulary keeps the embedding table small but forces long token sequences (more attention FLOPs, worse long-range memory); a large vocabulary compresses sequences but increases embedding-table memory and softmax cost, and every rare token gets fewer gradient updates. The practical sweet spot is 30 000 – 128 000 subword tokens.

Model	Tokenizer	Vocab size	Notes
GPT-2	Byte-level BPE	50 257	r50k_base; 40 000 merges + 256 bytes + <\|endoftext\|>
GPT-3 / InstructGPT	Byte-level BPE	50 257	Same r50k_base vocab as GPT-2
GPT-3.5 / GPT-4	Byte-level BPE	100 277	cl100k_base; 100 000 merges + <\|endoftext\|> specials
GPT-4o / o-series	Byte-level BPE	200 019	o200k_base; more efficient for code + multilingual
BERT (base/large)	WordPiece	30 522	## prefix for continuation tokens; [UNK] still present
T5	Unigram LM (SP)	32 000	SentencePiece; sentence-piece prefix ▁
LLaMA 1 / 2	Unigram LM (SP)	32 000	SentencePiece; separate byte fallback tokens
LLaMA 3 / 3.1	Byte-level BPE (Tiktoken)	128 000	128 000 base + 256 special tokens; tiktoken format
Mistral / Mixtral	Unigram LM (SP)	32 000	Same as LLaMA 2 tokenizer
DeepSeek-V3	Byte-level BPE	102 400	100 000 BPE merges + extra special tokens
Qwen 2.5	Byte-level BPE (Tiktoken)	151 936	Large vocab improves code + multilingual efficiency

3. BPE — Byte Pair Encoding

Sennrich et al. (2016, arXiv:1508.07909) adapted the data-compression BPE algorithm for NLP: start with a vocabulary of individual characters and iteratively merge the most frequently co-occurring adjacent pair into a new symbol. After N merges the vocabulary has |chars| + N symbols. Encoding a new word applies these merges in order, greedily, left to right.

Core Mechanism — BPE Training

Background: We have a corpus of words with frequencies. Rare words share substrings with common words (e.g., "newest" shares "est" with "widest"). BPE finds these shared structures bottom-up by frequency.

Plan:

Split every word in the corpus into characters; append an end-of-word marker </w>. Count word frequencies.
Count every adjacent pair of symbols across all word instances (weighted by word frequency).
Merge the highest-frequency pair into a new symbol everywhere in the corpus.
Add the new symbol to the vocabulary. Repeat until target vocabulary size is reached.
To encode a new word: apply the learned merge rules in order, greedily.

Example walkthrough — initial corpus: {low×5, lower×2, newest×6, widest×3}

Merge #	Best pair	Freq	New symbol	State of "newest"
0	—	—	—	n e w e s t </w>
1	(e, s)	9	es	n e w es t </w>
2	(es, t)	9	est	n e w est </w>
3	(est, </w>)	9	est</w>	n e w est</w>
4	(n, e)	6	ne	ne w est</w>
5	(ne, w)	6	new	new est</w>
6	(new, est</w>)	6	newest</w>	newest</w>

After just 6 merges, "newest" is a single token — the model doesn't need to reconstruct it from parts. But crucially, "est" and "new" are also tokens in the vocabulary, so rare words like "tallest" or "newer" get reasonable segmentations even if they were absent from training data.

BPE Encoding at Inference

# Encoding "lower" given merges from the example above:
Initial:   l  o  w  e  r  </w>
Merge 1 (e,s)→es:  no change  (no 'e','s' adjacent in "lower")
Merge 2 (es,t)→est:  no change
Merge 3 (est,</w>)→est</w>:  no change
Merge 7 (l,o)→lo:  lo  w  e  r  </w>
Merge 8 (lo,w)→low:  low  e  r  </w>
Merge 9 (e,r)→er:  low  er  </w>
Merge 10 (er,</w>)→er</w>:  low  er</w>

Final tokens:  ["low", "er</w>"]   — 2 tokens

4. WordPiece — BERT's Tokenizer

Schuster & Nakamura (2012) and Wu et al. (2016, arXiv:1609.08144) introduced WordPiece for Google's neural machine translation. The training algorithm differs from BPE in how it scores merge candidates: instead of raw pair frequency, it uses a likelihood gain criterion — merge the pair (A, B) that most increases the language-model log-likelihood of the corpus:

BPE    score(A, B) = count(AB)
WordPiece score(A, B) = count(AB) / (count(A) × count(B))

Intuition: WordPiece penalises merging two very frequent tokens
(e.g., "th" + "e" → "the"), preferring pairs where co-occurrence
exceeds the independence baseline — i.e., tokens that are
strongly mutually predictive.

At encoding time, WordPiece uses a longest-match-first strategy: find the longest prefix of the remaining string that is in the vocabulary; emit it; repeat. Continuation tokens (not at the start of a word) are prefixed with ## to distinguish from initial tokens.

BERT tokenizes "unbelievability":
  un  ##believ  ##abil  ##ity
  ↑ word start   ↑ continuations prefixed with ##

BERT vocab (30 522 tokens) is fixed — modern practice is to
retrain WordPiece from scratch for each domain or language.

5. Unigram LM — SentencePiece

Kudo (2018, arXiv:1804.10959) inverts the BPE/WordPiece approach: instead of starting small and growing, start with a large vocabulary (all possible substrings up to some length, plus the full words) and prune it. At each step, remove the token whose removal causes the smallest increase in total corpus negative log-likelihood under a unigram language model. The EM algorithm alternates between:

E-step: Compute each word's optimal segmentation using the Viterbi algorithm over the current vocabulary probabilities.
M-step: Update unigram probabilities from the expected token counts across all segmentations.
Prune the bottom η% (typically 20%) of tokens that would be missed least if removed. Repeat.

The key advantage: because the model is probabilistic, multiple segmentations of the same word are valid. During training, the model can randomly sample a non-optimal segmentation (subword regularisation), acting as a form of data augmentation that makes the model more robust to how words are split.

SentencePiece (Kudo & Richardson 2018) is the library that implements both BPE and Unigram LM with a unified API. It also handles raw bytes as a fallback and adds a space prefix ▁ (U+2581) to mark word boundaries, making the tokenizer language-agnostic (no whitespace assumptions for Chinese/Japanese).

6. Byte-Level BPE — GPT-2 Onward

Radford et al. (GPT-2, 2019) changed the base alphabet from characters (Unicode codepoints) to bytes. Every UTF-8 string can be expressed as a sequence of 256 possible bytes, so the model can always represent any text — no [UNK] token is ever needed. BPE merges are then performed on byte sequences rather than character sequences.

GPT-2 tokenizer (r50k_base):
  Base vocab: 256 bytes
  Plus:       40 000 BPE merges
  Total:      50 256 + 1 special (<|endoftext|>) = 50 257 tokens

cl100k_base (GPT-3.5 / GPT-4):
  Total:      100 277 tokens (100 000 merges)
  Splits digits individually: "100" → ["1","0","0"]  ← intentional

o200k_base (GPT-4o / o-series):
  Total:      200 019 tokens
  More aggressive merging for code and multilingual text

Tiktoken (OpenAI) is the fast Rust-backed Python library that encodes text using these vocabularies. It is roughly 3–6× faster than the HuggingFace Python tokenizer for the same vocabulary due to the Rust core and regex-based pre-tokenization.

7. BPE vs WordPiece vs Unigram — Side-by-Side

Property	BPE	WordPiece	Unigram LM
Training start	Characters	Characters + ## prefix	All possible substrings
Merge criterion	max count(A,B)	max count(AB)/(count(A)×count(B))	min ΔNLLcorpus
Training direction	Bottom-up (add)	Bottom-up (add)	Top-down (prune)
Encoding	Apply merges L→R in order	Longest-match-first	Viterbi / greedy argmax
Multiple segs?	No (deterministic)	No (deterministic)	Yes (sample for regularisation)
UNK handling	Byte-level variant: none	Still has [UNK]	Byte fallback in SP
Word boundary	End-of-word token </w>	## prefix for continuation	▁ prefix for word start
Example: 'unbelievability'	un bel iev abil ity	un ##believ ##abil ##ity	▁un believ ability

8. Tokenization Pitfalls

8.1 Leading-space sensitivity

GPT-2 / GPT-4 tokenizers attach leading whitespace to word tokens. "cat" and " cat" are different token IDs. This means the token stream encodes position within a sentence implicitly, but it also means prompts must match training formatting exactly or the model sees an out-of-distribution token.

tiktoken.encode("cat")     → [9246]           # no leading space
tiktoken.encode(" cat")    → [3797]           # leading space: different ID!
tiktoken.encode("Cat")     → [35, 265]        # capital C splits differently

8.2 Digit splitting

cl100k_base (GPT-4) deliberately splits each digit into its own token. "12345" becomes ["1","2","3","4","5"]. This was an intentional design choice to let the model learn positional arithmetic more easily — the model can attend to individual digit positions. However it also means long numbers tokenise very inefficiently.

8.3 Multilingual fairness

A tokenizer trained on English-dominant data produces short tokens for common English words but long character sequences for non-Latin scripts. A study by Ahia et al. (2023) found that Yoruba requires ~7× more tokens than English for the same amount of information — meaning lower-resource languages are penalised in context-window utilisation and inference cost. LLaMA 3's 128K vocabulary was partly motivated by closing this gap.

Language	Text (same meaning)	cl100k tokens	Relative cost
English	"The cat sat on the mat"	7	1×
French	"Le chat était assis sur le tapis"	9	1.3×
Chinese	"猫坐在垫子上"	7	1×
Arabic	"جلس القط على الحصيرة"	14	2×
Yoruba	"Ológìnni jókòó lórí ìtẹ̀"	48	≈7×

8.4 Code tokenization

Indentation-heavy languages (Python, YAML) tokenise poorly: 4 spaces may not be a single token, meaning the model must count tokens rather than reading structural whitespace. cl100k and o200k improved this by merging common indentation sequences. In CodeLLaMA and DeepSeek-Coder, the tokenizer was specifically retrained with code-heavy corpora.

9. Minimal Demos

Demo A — BPE from scratch: char vs word vs BPE token counts

Trains 20 BPE merges on a tiny English corpus, then tokenizes several test words. Watch how "lowest" collapses from 6 character tokens to 2 BPE tokens.

BPE Tokenizer from Scratch — C Demo

stdin (optional)

Demo B — Train BPE with configurable vocab size

Enter a target vocabulary size. The demo shows every merge step, then applies the learned rules to tokenize "newest", "widest", and "lower". Try 15 vs 30 to see how word-complete tokens emerge as merges accumulate.

Train Your Own BPE — C Demo

stdin (optional)

10. Production / Source Pointers

Library / File	Key symbol	What it does
tiktoken/core.py	Encoding.encode()	Fast BPE encoding via Rust core; supports cl100k/o200k/r50k
tiktoken/_educational.py	SimpleBytePairEncoding	Pure-Python reference implementation of GPT-2 BPE (great reading)
huggingface/tokenizers (Rust)	BpeTrainer, models/BPE	Training and encoding BPE; tokenizers.train_from_iterator()
huggingface/tokenizers	models/WordPiece, WordPieceTrainer	BERT-style WordPiece train + encode
google/sentencepiece	SentencePieceTrainer.train()	Train Unigram LM or BPE tokenizer from raw text corpus
sentencepiece/src/bpe_model.cc	BpeModel::Encode()	C++ implementation of greedy BPE merge at inference time
transformers/tokenization_utils_fast.py	PreTrainedTokenizerFast	Wraps HF tokenizers Rust core; used by all modern HF models
Karpathy/minbpe (GitHub)	regex.py, basic.py	Pedagogical BPE from scratch; highly readable ~300 lines total

11. References

Papers

Sennrich, Haddow, Birch (2016) — Neural Machine Translation of Rare Words with Subword Units (BPE for NLP). arXiv:1508.07909
Wu et al. (2016) — Google's Neural Machine Translation System (WordPiece). arXiv:1609.08144
Kudo (2018) — Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates (Unigram LM). arXiv:1804.10959
Kudo & Richardson (2018) — SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing. arXiv:1808.06226
Radford et al. (2019) — Language Models are Unsupervised Multitask Learners (GPT-2; introduces byte-level BPE). OpenAI blog.
Ahia et al. (2023) — Do All Languages Cost the Same? Tokenization in the Era of Commercial Language Models. arXiv:2305.13707 (multilingual fairness)
Rust et al. (2021) — How Good is Your Tokenizer? arXiv:2012.15613

Lectures

Stanford CS224n (Manning) — Lecture 2: Word Vectors; Lecture 12: Subword Models
Hugging Face NLP Course — Chapter 6: The Tokenizers Library (covers BPE / WordPiece / Unigram training hands-on)
Karpathy — Let's build the GPT Tokenizer (YouTube, 2024, ~2 hrs; implements tiktoken-compatible BPE from scratch)
CMU 11-711 (Graham Neubig) — Advanced NLP, Lecture on Tokenization and Subword Models

Textbooks

Jurafsky & Martin — Speech and Language Processing, 3rd ed. (draft), Ch. 2: Regular Expressions, Text Normalization, Edit Distance
Eisenstein — Introduction to Natural Language Processing, Ch. 2: Linear Text Classification (covers tokenization, n-grams)

Code / Repos

openai/tiktoken — Production BPE tokenizer (Rust + Python); cl100k / o200k / r50k vocabs
karpathy/minbpe — Minimal BPE from scratch in ~300 lines; ideal for learning
huggingface/tokenizers — Rust-backed tokenizer library; BPE / WordPiece / Unigram all supported
google/sentencepiece — C++ library for Unigram LM and BPE; used by T5, LLaMA 2, Mistral

Blog Posts

Hugging Face — Summary of the Tokenizers (comprehensive overview of all three algorithms with worked examples)
Lena Voita — NLP Course: Text Classification → Subword Tokenization (interactive visualisations)
Jay Alammar — The Illustrated GPT-2 (Section on tokenization pipeline)
Lei Mao — Byte Pair Encoding (step-by-step with implementation)

12. Interview Prep

1. Why does BPE produce a more efficient tokenization than word-level for English?

Word-level vocabularies need millions of entries to cover rare words, conjugations, and compounds — and still produce UNK for anything unseen. BPE shares tokens across morphologically related words ("run", "running", "runner" all use subword "run"), reducing vocabulary size by 10-100× while maintaining near-zero OOV. It also handles code, URLs, and multilingual text without a separate OOV mechanism.

2. What is the exact difference between BPE and WordPiece in how they choose which pair to merge?

BPE merges the pair with the highest raw co-occurrence count: score(A,B) = count(AB). WordPiece normalises by individual token counts: score(A,B) = count(AB) / (count(A) × count(B)), which is a mutual-information-like ratio. WordPiece will not merge two very common tokens (like 'th' + 'e') unless they co-occur far more than would be expected by independence — it prefers linguistically meaningful units.

3. Compute the perplexity of a uniform distribution over a 50 257-token vocabulary.

A uniform LM assigns probability 1/V = 1/50257 to every token. Perplexity = exp(H) where H = log(V) = log(50257) ≈ 10.83 nats = 15.6 bits. So perplexity ≈ exp(10.83) ≈ 50 257. Perplexity of a uniform distribution equals the vocabulary size.

4. Why does cl100k_base (GPT-4) split numbers digit-by-digit, and what problem does that solve?

Splitting "12345" into ["1","2","3","4","5"] gives the model explicit positional access to each digit, making column-aligned arithmetic easier to learn. When numbers were merged (e.g., "100" as one token), addition required the model to memorize digit compositions internally; digit-splitting moves this burden to the attention mechanism, which handles positional reasoning better.

5. What is byte-level BPE and why does it eliminate the UNK token?

Byte-level BPE uses the 256 possible byte values (0x00–0xFF) as the base vocabulary instead of Unicode characters. Any UTF-8 string — including emoji, CJK characters, arbitrary code, and even corrupted text — can be expressed as a sequence of bytes. Since all 256 bytes are in the vocabulary, no input can produce an UNK token. Merges then happen on byte pairs, gradually combining common byte sequences into whole words or subwords.

6. How does Unigram LM tokenization differ architecturally from BPE, and what is subword regularisation?

BPE builds vocabulary bottom-up by greedy frequency merges; encoding is deterministic (apply merges in order). Unigram LM starts with a large over-complete vocabulary and prunes it top-down by EM, assigning probabilities to each token. Encoding uses Viterbi (or sampling) over these probabilities. Subword regularisation exploits that multiple valid segmentations exist: during training, randomly sample a non-MAP segmentation, exposing the model to different splits of the same word — acting as data augmentation that makes the model robust to tokenization choices.

7. Why is tokenization unfair across languages? Name one mitigation.

Tokenizers trained on English-heavy corpora learn long merged tokens for English words but fall back to individual bytes/chars for low-resource scripts — effectively penalising those languages with 3–10× more tokens for the same semantic content. Mitigations: (1) train the tokenizer on a balanced multilingual corpus (LLaMA 3's 128K tokenizer); (2) oversample underrepresented scripts in the merge corpus; (3) use byte-fallback to ensure coverage while adding language-specific high-frequency tokens.

8. What is the SentencePiece ▁ (U+2581) prefix, and how does it differ from BERT's ## marker?

SentencePiece adds ▁ (lower one-eighth block) as a word-start marker: a token ▁cat appears at the start of a word, cat without ▁ is a continuation. This is language-agnostic — no whitespace required, so Chinese/Japanese work without changes. BERT inverts this: tokens without ## are word-initial, ## prefix marks continuation. Both encode the same information (word boundary) but with opposite polarity.