§ 5.1 Pre-Neural NLP: n-grams, HMM, CRF, TF-IDF
Bag-of-words, n-gram language models, smoothing, perplexity, HMM POS tagging, and CRF sequence labeling — the foundations that transformers quietly replaced but never made irrelevant.
1. Overview
Before neural networks conquered NLP, the field built a rich stack of probabilistic models that remain conceptually foundational. TF-IDF powers production search engines today. N-gram perplexity is still the standard unit for evaluating language models — GPT-4 reports perplexity on benchmarks for exactly the same reason Shannon defined entropy in 1948. HMM / CRF intuitions re-appear inside modern sequence labelers and structured prediction heads. Understanding these tools answers the question "what exactly did the transformer replace, and why?"
The diagram above maps the classic NLP pipeline: raw text flows through preprocessing into one of three independent tracks — retrieval (TF-IDF), generation (n-gram LM), or structured tagging (HMM / CRF). Each track produces a different kind of output and has a different mathematical formulation. Modern LLMs absorb all three into a single forward pass, but the math each track introduced is still alive inside the training objective and the evaluation suite.
2. TF-IDF & Bag-of-Words Retrieval
The bag-of-words (BoW) model discards word order and represents each document as a sparse vector of term frequencies. Cosine similarity between two BoW vectors gives a retrieval score. The problem: common words like "the" dominate and obscure content words. TF-IDF fixes this by down-weighting words that appear in many documents:
TF(t, d) = count(t in d) / |d| IDF(t) = log( N / df(t) ) # N = total docs, df = docs containing t TF-IDF(t, d) = TF(t, d) × IDF(t)
A word unique to a single document gets a high IDF (log(N/1) = log N); "the" which appears in every document gets IDF ≈ 0. The product rewards words that are both frequent in this document and rare across the corpus.
The matrix above shows a 3-document, 3-word toy example. "dog" appears only in Doc 3, so IDF = log(3/1) ≈ 1.099 — the highest weight. "cat" and "sat" each appear in two documents, IDF ≈ 0.405. Retrieval ranks documents by cosine similarity of their TF-IDF vectors against a query vector built the same way. BM25 is a tuned variant of TF-IDF still used in Elasticsearch, Lucene, and hybrid RAG pipelines today.
3. N-gram Language Models
A language model assigns a probability to every sequence of words. The chain rulefactorises this exactly:
P(w_1, ..., w_T) = P(w_1) × P(w_2|w_1) × P(w_3|w_1,w_2) × ...
= ∏_{t=1}^{T} P(w_t | w_1...w_{t-1})The Markov assumption truncates the history to the last n−1 words. A trigram model (n=3) uses:
P(w_t | w_{t-1}, w_{t-2}) ≈ C(w_{t-2}, w_{t-1}, w_t) / C(w_{t-2}, w_{t-1})3.1 Smoothing
Unseen n-grams get probability 0 from raw counts — assigning 0 probability to any test token makes the whole sentence probability 0 and perplexity infinite. Smoothing redistributes mass from observed to unseen n-grams:
| Method | Formula | Notes |
|---|---|---|
| Laplace (add-1) | (C(h,w)+1)/(C(h)+V) | Simple but over-smooths for large V |
| Good-Turing | c* = (c+1)·N_{c+1}/N_c | Adjusts counts based on frequency-of-frequencies |
| Kneser-Ney | P_KN(w|h) = max(C-d,0)/C(h) + λ·P_continuation(w) | Best n-gram smoothing; "continuation probability" captures versatility of a word |
| Interpolation | P = λ₃P₃ + λ₂P₂ + λ₁P₁ | Mix trigram + bigram + unigram; λ tuned on dev set |
3.2 Perplexity
Perplexity (PPL) is the standard intrinsic metric for language models — it is the exponential of the average negative log-probability per word:
PPL(W) = 2^{ -1/N × Σ log_2 P(w_t | context) }
= P(W)^{-1/N} # N = number of tokensIntuition: if a model has perplexity 100, it is as confused as if it had to choose uniformly among 100 equally likely options at every step. Lower is better. A uniform distribution over a 50 257-token vocabulary has PPL = 50 257. GPT-2 (1.5 B) reaches PPL ≈ 18 on Penn Treebank. GPT-4-class models reach single-digit PPL on the same benchmark.
4. HMM POS Tagging & Viterbi Decoding
Part-of-speech (POS) tagging is the task of labelling each word with its grammatical role: noun, verb, determiner, etc. The Hidden Markov Model (HMM) treats POS tags as hidden states and observed words as emissions. It is a generative model — it models the joint probability P(words, tags).
4.1 Core Mechanism: Viterbi
Background: Given a sentence of T words, there are |S|T possible tag sequences. Brute-force enumeration is exponential; Viterbi achieves O(T·|S|²) by dynamic programming over a trellis.
Plan:
- Initialise:
v[t=0][s] = π[s] × B[s][obs₀] - Recurse:
v[t][s] = max_{s'} v[t-1][s'] × A[s'][s] × B[s][obs_t] - Store backpointer:
bp[t][s] = argmax_{s'} ... - Terminate: best final state =
argmax_s v[T-1][s] - Backtrack: follow backpointers from T-1 to 0 to recover the full tag sequence
Example walkthrough — sentence "the cat sat", states DT / NN / VB:
| t | word | v[DT] | v[NN] | v[VB] | winner |
|---|---|---|---|---|---|
| 0 | the | 0.420 | 0.003 | 0.000 | DT |
| 1 | cat | 0.001 | 0.118 | 0.002 | NN |
| 2 | sat | 0.009 | 0.003 | 0.038 | VB |
Backtracking from VB at t=2 → NN at t=1 → DT at t=0 gives the tag sequence DT NN VB, which is correct.
5. CRF — Conditional Random Fields
HMMs are generative and have a strong independence assumption: each word depends only on its tag, not on neighbouring words. CRFs are discriminative — they model P(tags | words) directly — and can incorporate arbitrary features of the entire input sequence (word shape, prefix, suffix, capitalization, context window). This is the critical advantage for NER and other sequence labeling tasks.
Linear-chain CRF:
P(y | x) = (1/Z(x)) × exp( Σ_t Σ_k λ_k f_k(y_t, y_{t-1}, x, t) )
f_k — feature functions (binary or real-valued)
λ_k — learned weights (trained by maximising log-likelihood)
Z(x) — partition function = Σ_{y'} exp(...) — ensures normalisation
Decoding: same Viterbi DP, using log-linear scores instead of P(s'→s)×P(w|s).Example CRF features for NER:
word.lower()— the word itself (lowercase)word.isupper()— is it ALL CAPS?word[:3]— prefix, useful for "un-", "re-"word[-3:]— suffix, useful for "-ing", "-tion"prev_word, next_word— context window
6. Minimal Demos
Demo 1 — Trigram LM on Shakespeare + Perplexity
Build a trigram language model with Laplace smoothing on a small Shakespeare corpus, measure perplexity on in-domain vs out-of-domain text, and generate a short continuation.
Demo 2 — Viterbi POS Tagging Step-by-Step
Hand-trace the Viterbi algorithm on "the cat sat" with a tiny 3-tag HMM. Every DP cell is printed so you can follow the recurrence manually.
7. Statistical MT — IBM Models (teaser)
Statistical machine translation (SMT) peaked around 2012–2014 before being displaced by neural seq2seq. Its core intuition shaped the transformer's cross-attention:
Noisy-channel model:
argmax_e P(e | f) = argmax_e P(f | e) × P(e)
────────── ─────
Translation Language
model model
IBM Model 1: P(f | e) = Π_j Σ_i t(f_j | e_i) / (I+1)^J
— t(f|e) are word-alignment probabilities, EM-estimated
— alignment: which source word generated each target word?
Phrase-based MT (Moses): aligns phrases, not words; uses a log-linear model
with ~5 features (TM, LM, reordering, word-penalty, phrase-penalty)The alignment table t(f|e) is a soft alignment matrix between source and target — exactly what Bahdanau attention in seq2seq models learned to compute from gradient descent instead of EM, and what the transformer's cross-attention generalises further.
8. Production / Source Pointers
| Tool / Concept | Production impl | Key entry point |
|---|---|---|
| TF-IDF / BM25 | Apache Lucene / Elasticsearch | BM25Similarity.java; BM25Scorer |
| N-gram LM + KN smoothing | KenLM (C++) | lm/model.hh :: Model::Score() |
| HMM / Viterbi | NLTK, spaCy | nltk.tag.HiddenMarkovModelTagger; spacy.en_core_web_sm |
| CRF (linear-chain) | sklearn-crfsuite, spaCy (NER head) | sklearn_crfsuite.CRF.fit() |
| CRF on BERT | HuggingFace transformers | BertForTokenClassification + torchcrf.CRF |
| Perplexity eval | lm-evaluation-harness | lm_eval/evaluator.py :: evaluate() |
9. References
Papers
- Shannon, C.E. (1948). "A Mathematical Theory of Communication." Bell System Technical Journal.
- Lafferty, J., McCallum, A., Pereira, F. (2001). "Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data." ICML 2001.
- Chen, S.F. & Goodman, J. (1999). "An empirical study of smoothing techniques for language modeling." Computer Speech and Language, 13(4).
- Kneser, R. & Ney, H. (1995). "Improved backing-off for m-gram language modeling." ICASSP 1995.
- Bahdanau, D., Cho, K., Bengio, Y. (2015). "Neural Machine Translation by Jointly Learning to Align and Translate." arXiv:1409.0473.
- Robertson, S.E. & Spärck Jones, K. (1976). "Relevance weighting of search terms." Journal of ASIS, 27(3).
Lectures
- Stanford CS224n (Manning) — Lectures 1–4: Introduction, Word Vectors, Language Models
- CMU 11-411/611 — NLP: HMM POS tagging, CRF sequence labeling modules
- MIT 6.034 — AI lecture series: Probabilistic inference, Markov models
- Oxford Deep Learning for NLP (Hilary) — N-gram baselines
Textbooks
- Jurafsky, D. & Martin, J.H. — Speech and Language Processing, 3rd ed. (free draft). Chapters 3 (N-grams), 8 (HMMs and POS), 17 (Information Extraction / CRF).
- Eisenstein, J. — Introduction to Natural Language Processing (MIT Press, free PDF). Chapters 2–6.
- Manning, C. & Schütze, H. — Foundations of Statistical Natural Language Processing. MIT Press, 1999.
Code / Repos
kpu/kenlm— fastest n-gram LM in C++ (arpa format)nltk/nltk— HMM tagger, CFG parsersexplosion/spaCy— production NER with CRF headElasticSearch/elasticsearch— BM25 in productionEleutherAI/lm-evaluation-harness— PPL + benchmark eval
Blog Posts
- Hugging Face NLP Course — Chapter 1: Introduction to NLP
- Jurafsky & Martin SLP3 web chapter — "N-gram Language Models" (chapter 3 draft)
- Chris Olah — "Understanding LSTM Networks" (2015)
10. Interview Prep
- What is perplexity and how does it relate to cross-entropy?
PPL = 2H where H = −(1/N)Σ log₂ P(wᵢ | context) is the per-token cross-entropy in bits. Lower PPL = higher average token probability. A random model over V tokens has PPL = V. - Why does Laplace smoothing over-smooth for large vocabularies?
Adding 1 to every count inflates the denominator by V. For V = 100K and a bigram count of 2, Laplace reduces P(w|h) from 2/N to ≈3/(N+100K), cutting it roughly by a factor of N/100K. Kneser-Ney uses a fractional discount d ≈ 0.75 tuned on held-out data, avoiding this. - What is the time complexity of Viterbi decoding and why?
O(T · |S|²): for each of T time steps, for each of |S| current states, we take the max over |S| previous states — |S| multiplications → |S|² per step. Compare: brute force |S|T. - What is the key difference between HMM and CRF for sequence labeling?
HMM is generative: models P(words, tags) with independence assumptions. CRF is discriminative: models P(tags | words) with arbitrary features of the full input. CRF generally achieves better NER accuracy because it can use context window, prefix/suffix, and shape features simultaneously. - Why is TF-IDF still used in modern RAG pipelines alongside dense embeddings?
Dense embeddings capture semantic similarity but can miss exact keyword matches (e.g., product codes, rare names, medical terms). BM25 (a TF-IDF variant) excels at exact lexical matching. Hybrid retrieval scores = α × BM25 + (1-α) × dense consistently outperforms either alone on BEIR benchmark. - What is the Kneser-Ney continuation probability and why does it outperform Laplace?
PKN(w) = |{v : C(v,w) > 0}| / |{(u,v) : C(u,v) > 0}| — the fraction of unique bigrams w completes. "Francisco" appears often but only after "San", so it gets low continuation probability and won't be predicted in novel contexts. Laplace gives it high probability just because C("Francisco") is large. - What problem does the IBM translation model alignment table solve, and how does it relate to transformer attention?
IBM Model 1 learns t(f|e) — the probability that source word e generated target word f — via EM on parallel corpora. This is a soft alignment: each target word attends to all source words with learnable weights. Bahdanau attention (2015) replaced EM with gradient descent over the same alignment intuition. Transformer cross-attention generalises this to multi-head, scaled dot-product form. - Compare PPL of 18 vs PPL of 50,257 — what does each mean intuitively?
PPL = 50,257 ≈ uniform over the full vocab; the model has learned nothing. PPL = 18 means the model's average per-token distribution is as concentrated as a uniform distribution over only 18 tokens — it is usually right about the top few candidates. GPT-2 1.5B reaches PPL ≈ 18 on Penn Treebank; GPT-4-class models reach <5.