Part V — NLP Foundations

§ 5.1 Pre-Neural NLP: n-grams, HMM, CRF, TF-IDF

Bag-of-words, n-gram language models, smoothing, perplexity, HMM POS tagging, and CRF sequence labeling — the foundations that transformers quietly replaced but never made irrelevant.

1. Overview

Before neural networks conquered NLP, the field built a rich stack of probabilistic models that remain conceptually foundational. TF-IDF powers production search engines today. N-gram perplexity is still the standard unit for evaluating language models — GPT-4 reports perplexity on benchmarks for exactly the same reason Shannon defined entropy in 1948. HMM / CRF intuitions re-appear inside modern sequence labelers and structured prediction heads. Understanding these tools answers the question "what exactly did the transformer replace, and why?"

The diagram above maps the classic NLP pipeline: raw text flows through preprocessing into one of three independent tracks — retrieval (TF-IDF), generation (n-gram LM), or structured tagging (HMM / CRF). Each track produces a different kind of output and has a different mathematical formulation. Modern LLMs absorb all three into a single forward pass, but the math each track introduced is still alive inside the training objective and the evaluation suite.

2. TF-IDF & Bag-of-Words Retrieval

The bag-of-words (BoW) model discards word order and represents each document as a sparse vector of term frequencies. Cosine similarity between two BoW vectors gives a retrieval score. The problem: common words like "the" dominate and obscure content words. TF-IDF fixes this by down-weighting words that appear in many documents:

TF(t, d)  = count(t in d) / |d|
IDF(t)    = log( N / df(t) )          # N = total docs, df = docs containing t
TF-IDF(t, d) = TF(t, d) × IDF(t)

A word unique to a single document gets a high IDF (log(N/1) = log N); "the" which appears in every document gets IDF ≈ 0. The product rewards words that are both frequent in this document and rare across the corpus.

The matrix above shows a 3-document, 3-word toy example. "dog" appears only in Doc 3, so IDF = log(3/1) ≈ 1.099 — the highest weight. "cat" and "sat" each appear in two documents, IDF ≈ 0.405. Retrieval ranks documents by cosine similarity of their TF-IDF vectors against a query vector built the same way. BM25 is a tuned variant of TF-IDF still used in Elasticsearch, Lucene, and hybrid RAG pipelines today.

Modern relevance: Hybrid retrieval (BM25 + dense embeddings) outperforms either alone on many benchmarks. TF-IDF gives a sparse, interpretable first-pass that dense vectors then re-rank. Most production RAG pipelines use both.

3. N-gram Language Models

A language model assigns a probability to every sequence of words. The chain rulefactorises this exactly:

P(w_1, ..., w_T) = P(w_1) × P(w_2|w_1) × P(w_3|w_1,w_2) × ...
                 = ∏_{t=1}^{T} P(w_t | w_1...w_{t-1})

The Markov assumption truncates the history to the last n−1 words. A trigram model (n=3) uses:

P(w_t | w_{t-1}, w_{t-2}) ≈ C(w_{t-2}, w_{t-1}, w_t) / C(w_{t-2}, w_{t-1})

3.1 Smoothing

Unseen n-grams get probability 0 from raw counts — assigning 0 probability to any test token makes the whole sentence probability 0 and perplexity infinite. Smoothing redistributes mass from observed to unseen n-grams:

MethodFormulaNotes
Laplace (add-1)(C(h,w)+1)/(C(h)+V)Simple but over-smooths for large V
Good-Turingc* = (c+1)·N_{c+1}/N_cAdjusts counts based on frequency-of-frequencies
Kneser-NeyP_KN(w|h) = max(C-d,0)/C(h) + λ·P_continuation(w)Best n-gram smoothing; "continuation probability" captures versatility of a word
InterpolationP = λ₃P₃ + λ₂P₂ + λ₁P₁Mix trigram + bigram + unigram; λ tuned on dev set

3.2 Perplexity

Perplexity (PPL) is the standard intrinsic metric for language models — it is the exponential of the average negative log-probability per word:

PPL(W) = 2^{ -1/N × Σ log_2 P(w_t | context) }
       = P(W)^{-1/N}                              # N = number of tokens

Intuition: if a model has perplexity 100, it is as confused as if it had to choose uniformly among 100 equally likely options at every step. Lower is better. A uniform distribution over a 50 257-token vocabulary has PPL = 50 257. GPT-2 (1.5 B) reaches PPL ≈ 18 on Penn Treebank. GPT-4-class models reach single-digit PPL on the same benchmark.

Interview trap: PPL is model-specific — comparing PPL across models requires the same tokenizer and the same test split. A model with a larger vocabulary tokenizes the same text into fewer tokens, making its PPL not directly comparable to a smaller-vocab model.

4. HMM POS Tagging & Viterbi Decoding

Part-of-speech (POS) tagging is the task of labelling each word with its grammatical role: noun, verb, determiner, etc. The Hidden Markov Model (HMM) treats POS tags as hidden states and observed words as emissions. It is a generative model — it models the joint probability P(words, tags).

4.1 Core Mechanism: Viterbi

Background: Given a sentence of T words, there are |S|T possible tag sequences. Brute-force enumeration is exponential; Viterbi achieves O(T·|S|²) by dynamic programming over a trellis.

Plan:

  1. Initialise: v[t=0][s] = π[s] × B[s][obs₀]
  2. Recurse: v[t][s] = max_{s'} v[t-1][s'] × A[s'][s] × B[s][obs_t]
  3. Store backpointer: bp[t][s] = argmax_{s'} ...
  4. Terminate: best final state = argmax_s v[T-1][s]
  5. Backtrack: follow backpointers from T-1 to 0 to recover the full tag sequence

Example walkthrough — sentence "the cat sat", states DT / NN / VB:

twordv[DT]v[NN]v[VB]winner
0the0.4200.0030.000DT
1cat0.0010.1180.002NN
2sat0.0090.0030.038VB

Backtracking from VB at t=2 → NN at t=1 → DT at t=0 gives the tag sequence DT NN VB, which is correct.

5. CRF — Conditional Random Fields

HMMs are generative and have a strong independence assumption: each word depends only on its tag, not on neighbouring words. CRFs are discriminative — they model P(tags | words) directly — and can incorporate arbitrary features of the entire input sequence (word shape, prefix, suffix, capitalization, context window). This is the critical advantage for NER and other sequence labeling tasks.

Linear-chain CRF:

P(y | x) = (1/Z(x)) × exp( Σ_t Σ_k λ_k f_k(y_t, y_{t-1}, x, t) )

f_k  — feature functions (binary or real-valued)
λ_k  — learned weights (trained by maximising log-likelihood)
Z(x) — partition function = Σ_{y'} exp(...) — ensures normalisation

Decoding: same Viterbi DP, using log-linear scores instead of P(s'→s)×P(w|s).

Example CRF features for NER:

  • word.lower() — the word itself (lowercase)
  • word.isupper() — is it ALL CAPS?
  • word[:3] — prefix, useful for "un-", "re-"
  • word[-3:] — suffix, useful for "-ing", "-tion"
  • prev_word, next_word — context window
Modern relevance: CRF heads are still appended to BERT-family models for NER tasks. The CRF layer enforces globally valid tag sequences (e.g., I-PER cannot follow B-ORG) using the same Viterbi decoding from 1967.

6. Minimal Demos

Demo 1 — Trigram LM on Shakespeare + Perplexity

Build a trigram language model with Laplace smoothing on a small Shakespeare corpus, measure perplexity on in-domain vs out-of-domain text, and generate a short continuation.

Trigram LM — Shakespeare Perplexity — C Demo
stdin (optional)

Demo 2 — Viterbi POS Tagging Step-by-Step

Hand-trace the Viterbi algorithm on "the cat sat" with a tiny 3-tag HMM. Every DP cell is printed so you can follow the recurrence manually.

Viterbi POS Tagging — C Demo
stdin (optional)

7. Statistical MT — IBM Models (teaser)

Statistical machine translation (SMT) peaked around 2012–2014 before being displaced by neural seq2seq. Its core intuition shaped the transformer's cross-attention:

Noisy-channel model:
  argmax_e P(e | f)  =  argmax_e  P(f | e) × P(e)
                                  ──────────   ─────
                                  Translation  Language
                                  model        model

IBM Model 1:  P(f | e) = Π_j Σ_i t(f_j | e_i) / (I+1)^J
  — t(f|e) are word-alignment probabilities, EM-estimated
  — alignment: which source word generated each target word?

Phrase-based MT (Moses): aligns phrases, not words; uses a log-linear model
  with ~5 features (TM, LM, reordering, word-penalty, phrase-penalty)

The alignment table t(f|e) is a soft alignment matrix between source and target — exactly what Bahdanau attention in seq2seq models learned to compute from gradient descent instead of EM, and what the transformer's cross-attention generalises further.

8. Production / Source Pointers

Tool / ConceptProduction implKey entry point
TF-IDF / BM25Apache Lucene / ElasticsearchBM25Similarity.java; BM25Scorer
N-gram LM + KN smoothingKenLM (C++)lm/model.hh :: Model::Score()
HMM / ViterbiNLTK, spaCynltk.tag.HiddenMarkovModelTagger; spacy.en_core_web_sm
CRF (linear-chain)sklearn-crfsuite, spaCy (NER head)sklearn_crfsuite.CRF.fit()
CRF on BERTHuggingFace transformersBertForTokenClassification + torchcrf.CRF
Perplexity evallm-evaluation-harnesslm_eval/evaluator.py :: evaluate()

9. References

Papers

  • Shannon, C.E. (1948). "A Mathematical Theory of Communication." Bell System Technical Journal.
  • Lafferty, J., McCallum, A., Pereira, F. (2001). "Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data." ICML 2001.
  • Chen, S.F. & Goodman, J. (1999). "An empirical study of smoothing techniques for language modeling." Computer Speech and Language, 13(4).
  • Kneser, R. & Ney, H. (1995). "Improved backing-off for m-gram language modeling." ICASSP 1995.
  • Bahdanau, D., Cho, K., Bengio, Y. (2015). "Neural Machine Translation by Jointly Learning to Align and Translate." arXiv:1409.0473.
  • Robertson, S.E. & Spärck Jones, K. (1976). "Relevance weighting of search terms." Journal of ASIS, 27(3).

Lectures

  • Stanford CS224n (Manning) — Lectures 1–4: Introduction, Word Vectors, Language Models
  • CMU 11-411/611 — NLP: HMM POS tagging, CRF sequence labeling modules
  • MIT 6.034 — AI lecture series: Probabilistic inference, Markov models
  • Oxford Deep Learning for NLP (Hilary) — N-gram baselines

Textbooks

  • Jurafsky, D. & Martin, J.H. — Speech and Language Processing, 3rd ed. (free draft). Chapters 3 (N-grams), 8 (HMMs and POS), 17 (Information Extraction / CRF).
  • Eisenstein, J. — Introduction to Natural Language Processing (MIT Press, free PDF). Chapters 2–6.
  • Manning, C. & Schütze, H. — Foundations of Statistical Natural Language Processing. MIT Press, 1999.

Code / Repos

  • kpu/kenlm — fastest n-gram LM in C++ (arpa format)
  • nltk/nltk — HMM tagger, CFG parsers
  • explosion/spaCy — production NER with CRF head
  • ElasticSearch/elasticsearch — BM25 in production
  • EleutherAI/lm-evaluation-harness — PPL + benchmark eval

Blog Posts

  • Hugging Face NLP Course — Chapter 1: Introduction to NLP
  • Jurafsky & Martin SLP3 web chapter — "N-gram Language Models" (chapter 3 draft)
  • Chris Olah — "Understanding LSTM Networks" (2015)

10. Interview Prep

  1. What is perplexity and how does it relate to cross-entropy?
    PPL = 2H where H = −(1/N)Σ log₂ P(wᵢ | context) is the per-token cross-entropy in bits. Lower PPL = higher average token probability. A random model over V tokens has PPL = V.
  2. Why does Laplace smoothing over-smooth for large vocabularies?
    Adding 1 to every count inflates the denominator by V. For V = 100K and a bigram count of 2, Laplace reduces P(w|h) from 2/N to ≈3/(N+100K), cutting it roughly by a factor of N/100K. Kneser-Ney uses a fractional discount d ≈ 0.75 tuned on held-out data, avoiding this.
  3. What is the time complexity of Viterbi decoding and why?
    O(T · |S|²): for each of T time steps, for each of |S| current states, we take the max over |S| previous states — |S| multiplications → |S|² per step. Compare: brute force |S|T.
  4. What is the key difference between HMM and CRF for sequence labeling?
    HMM is generative: models P(words, tags) with independence assumptions. CRF is discriminative: models P(tags | words) with arbitrary features of the full input. CRF generally achieves better NER accuracy because it can use context window, prefix/suffix, and shape features simultaneously.
  5. Why is TF-IDF still used in modern RAG pipelines alongside dense embeddings?
    Dense embeddings capture semantic similarity but can miss exact keyword matches (e.g., product codes, rare names, medical terms). BM25 (a TF-IDF variant) excels at exact lexical matching. Hybrid retrieval scores = α × BM25 + (1-α) × dense consistently outperforms either alone on BEIR benchmark.
  6. What is the Kneser-Ney continuation probability and why does it outperform Laplace?
    PKN(w) = |{v : C(v,w) > 0}| / |{(u,v) : C(u,v) > 0}| — the fraction of unique bigrams w completes. "Francisco" appears often but only after "San", so it gets low continuation probability and won't be predicted in novel contexts. Laplace gives it high probability just because C("Francisco") is large.
  7. What problem does the IBM translation model alignment table solve, and how does it relate to transformer attention?
    IBM Model 1 learns t(f|e) — the probability that source word e generated target word f — via EM on parallel corpora. This is a soft alignment: each target word attends to all source words with learnable weights. Bahdanau attention (2015) replaced EM with gradient descent over the same alignment intuition. Transformer cross-attention generalises this to multi-head, scaled dot-product form.
  8. Compare PPL of 18 vs PPL of 50,257 — what does each mean intuitively?
    PPL = 50,257 ≈ uniform over the full vocab; the model has learned nothing. PPL = 18 means the model's average per-token distribution is as concentrated as a uniform distribution over only 18 tokens — it is usually right about the top few candidates. GPT-2 1.5B reaches PPL ≈ 18 on Penn Treebank; GPT-4-class models reach <5.