Part V — NLP Foundations

§ 5.2 Word Embeddings: Word2Vec, GloVe, FastText

Distributional hypothesis → dense vector representations → analogy arithmetic. The mathematics that turned NLP from sparse bag-of-words into geometry.

1. Overview

John Rupert Firth wrote in 1957: "you shall know a word by the company it keeps." This distributional hypothesis — that words with similar meanings appear in similar contexts — is the foundation of every embedding method, from Word2Vec to the attention matrix inside GPT-4. Word embeddings translate that intuition into dense real-valued vectors: words with similar contexts end up nearby in ℝ^d, and the direction of offset between vectors encodes semantic relationships.

The lineage from static to contextualised embeddings is a straight line: Word2Vec (2013) proved that a shallow neural net trained on a prediction objective produces geometrically meaningful vectors. GloVe (2014) showed the same can be achieved by factorising a log co-occurrence matrix. FastText (2017) extended both with subword n-grams. ELMo (2018) and BERT (2018) then made embeddings context-dependent — the same word gets a different vector depending on its sentence, solving the polysemy problem.

2. Word2Vec: Skip-gram and CBOW

Mikolov et al. (2013) introduced two architectures. Both train a shallow neural network whose weight matrix becomes the word embeddings — the prediction task is just a vehicle for learning the geometry.

2.1 Skip-gram Architecture

Skip-gram takes a center word and predicts each of its context words independently. The model is a lookup table (the embedding matrix) followed by a dot product and softmax.

Skip-gram objective (full softmax):
  J = -Σ_t Σ_{-k≤j≤k, j≠0} log P(w_{t+j} | w_t)

  P(o | c) = exp(u_o · v_c) / Σ_{w=1}^{V} exp(u_w · v_c)
             ─────────────────────────────────────────────
             v_c = input embedding (row of W_in)
             u_o = output embedding (row of W_out)

Window size k = 2 → 4 predictions per center word

2.2 Negative Sampling — Core Mechanism

Background: The full softmax denominator sums over all V ≈ 50 000 tokens at every gradient step — O(V) per update. At scale this is prohibitive: a 100M-word corpus with a window of 2 generates ~400M (center, context) pairs.

Plan:

Treat each (center, context) pair as a positive example (label 1).
Draw k random noise words from the corpus unigram distribution raised to the ¾ power: P_noise(w) ∝ count(w)^{3/4}. Label them 0.
Binary-cross-entropy on positive + k negatives replaces softmax.
Gradient cost drops from O(V) to O(k), where k = 5–20 in practice.

Negative sampling objective per positive pair (c, o):

  J_NS = log σ(v_c · u_o) + Σ_{i=1}^{k} E_{n_i~P_noise} [log σ(−v_c · u_{n_i})]

  σ(x) = 1/(1+exp(−x))

P_noise(w) ∝ count(w)^{0.75}  — the 3/4-power "smoothes" towards uniform;
   rare words get more training signal as negatives than with raw counts

Example walkthrough — sentence "the cat sat", window=1, k=2:

Step	Center	Context / Noise	Label	Loss term
+	cat	the	1	log σ(v_cat · u_the)
+	cat	sat	1	log σ(v_cat · u_sat)
−	cat	king (sampled)	0	log σ(−v_cat · u_king)
−	cat	walk (sampled)	0	log σ(−v_cat · u_walk)

After this update: v_cat moves closer to u_the and u_sat (context it really appears with), and further from u_king and u_walk (random noise). Repeat over a billion tokens and the geometry converges to encode semantic relationships.

3. GloVe — Global Vectors

Pennington et al. (2014) observed that Word2Vec implicitly factorises a shifted pointwise mutual information (PMI) matrix via stochastic gradient descent — but does so inefficiently because it sees only a tiny local window at a time. GloVe builds the full co-occurrence matrix first (one pass over the corpus) and then fits embeddings to it directly, exploiting global statistics from the start.

GloVe objective:
  J = Σ_{i,j=1}^{V} f(X_ij) (w_i · w_j + b_i + b_j − log X_ij)²

  f(x) = (x / x_max)^α   if x < x_max     x_max=100, α=0.75
          1                otherwise

Key insight — ratio of co-occurrence probabilities:
  log P(solid | ice) / P(solid | steam)  ≈ large positive  (ice is solid)
  log P(gas   | ice) / P(gas   | steam)  ≈ large negative  (steam is gas)
  log P(water | ice) / P(water | steam)  ≈ 0               (both relate to water)

  This ratio is what the vector offset w_ice − w_steam should encode.

4. FastText — Subword Embeddings

Both Word2Vec and GloVe represent each word as a single atomic unit. This breaks for morphologically rich languages and for any out-of-vocabulary (OOV) token. Bojanowski et al. (2017) decompose each word into character n-grams and represent the word as the sum of its n-gram embeddings.

Model	OOV words	Morphology	Training cost	Best for
Word2Vec SG	zero vector / UNK	none	fast (neg. sampling)	English, large corpora
GloVe	zero vector / UNK	none	one-time matrix build	stable, pretrained vecs widely available
FastText	inferred from n-grams	strong	~3× slower than W2V	Morphologically rich languages, code

5. Analogy Arithmetic & Evaluation

The most famous property of word vectors is that vector arithmetic encodes semantic relationships: king − man + woman ≈ queen. This works because directions in embedding space encode relational structure.

5.1 Intrinsic Evaluation

Intrinsic benchmarks measure embedding quality directly, without a downstream task:

Word Analogy (Google dataset): 19 564 analogies, 3COSADD method. Accuracy: W2V=65%, GloVe=75%, FastText=78% on syntactic analogies.
Word Similarity (WordSim-353, SimLex-999): Spearman correlation between cosine similarity and human judgements.
Nearest-neighbour coherence: top-10 neighbours should be semantically related.

5.2 Extrinsic Evaluation

Extrinsic evaluation plugs embeddings into a downstream task and measures that task's performance. Better embeddings → better downstream scores, but the correlation is imperfect: an embedding that wins on analogy may lose on NER. Extrinsic evaluation is more expensive but more reliable.

Task	What is measured	Metric
Sentiment classification (SST-2)	Sentiment signal in embedding space	Accuracy
Named Entity Recognition (CoNLL-2003)	Entity boundary + type geometry	F1
POS tagging (Penn Treebank)	Syntactic cluster quality	Accuracy

5.3 Limitations of Static Embeddings

Static embeddings assign one vector per word type, regardless of context. This creates fundamental limitations:

Polysemy: "bank" (river) and "bank" (financial) get the same vector — a weighted average of both senses.
OOV: Word2Vec and GloVe fail on words not in the training vocabulary. FastText mitigates this but doesn't eliminate it.
Context blindness: "I made her duck" — is "duck" a verb or noun? Static embeddings can't tell.

ELMo (2018) addressed this by using the BiLSTM hidden state at each word's position as its embedding — so the same word gets different vectors depending on its context. BERT (2018) replaced the BiLSTM with a bidirectional transformer, producing still richer context-dependent representations. These remain the same core idea: dense representation derived from prediction, now conditioned on context.

6. Minimal Demos

Demo 1 — Word2Vec Analogy Explorer

Analogy arithmetic on a hand-crafted 5-dimensional embedding space. Demonstrates the king − man + woman ≈ queen relationship via cosine nearest-neighbour search.

Word2Vec Analogy Arithmetic — C Demo

stdin (optional)

Demo 2 — Skip-gram Training from Scratch

Train a skip-gram model with negative sampling entirely in pure Python (no NumPy). After training, nearest-neighbour lookup shows that "cat" is close to "kitten" — a relationship only present because they co-occur in similar contexts.

Skip-gram Training from Scratch — C Demo

stdin (optional)

7. Production / Source Pointers

Tool	Repo / package	Key entry point
Word2Vec (original C)	google-code-archive word2vec	word2vec.c :: TrainModel()
Word2Vec (Gensim Python)	piskvorky/gensim	gensim.models.Word2Vec; .wv.most_similar()
GloVe	stanfordnlp/GloVe	src/glove.c :: GloVe(); pretrained vectors at nlp.stanford.edu
FastText	facebookresearch/fastText	fasttext.train_unsupervised(); cc.en.300.bin pretrained
nn.Embedding (PyTorch)	pytorch/pytorch	torch.nn.Embedding(V, d); from_pretrained(weights)
Evaluation harness	kudkudak/word-embeddings-benchmarks	scripts/evaluate_on_all.py

8. References

Papers

Mikolov, T. et al. (2013). "Efficient Estimation of Word Representations in Vector Space." arXiv:1301.3781.
Mikolov, T. et al. (2013). "Distributed Representations of Words and Phrases and their Compositionality." arXiv:1310.4546. (Negative sampling paper)
Pennington, J., Socher, R., Manning, C. (2014). "GloVe: Global Vectors for Word Representation." EMNLP 2014.
Bojanowski, P. et al. (2017). "Enriching Word Vectors with Subword Information." arXiv:1607.04606. (FastText)
Levy, O. & Goldberg, Y. (2014). "Neural Word Embedding as Implicit Matrix Factorization." NeurIPS 2014.
Peters, M. et al. (2018). "Deep Contextualized Word Representations." arXiv:1802.05365. (ELMo)
Firth, J. R. (1957). "A synopsis of linguistic theory 1930–55." Studies in Linguistic Analysis.

Lectures

Stanford CS224n (Manning) — Lecture 1: Word Vectors; Lecture 2: Word Vectors 2 & Word Senses
Stanford CS224n — Lecture notes on GloVe, negative sampling derivation
CMU 11-711 — Advanced NLP: Representation Learning module
NYU DS-GA 1008 (LeCun) — Lecture on energy-based learning & word vectors

Textbooks

Jurafsky & Martin — Speech and Language Processing, 3rd ed. Chapter 6: Vector Semantics and Embeddings.
Goldberg, Y. — Neural Network Methods for Natural Language Processing. Morgan & Claypool, 2017.

Code / Repos

piskvorky/gensim — Word2Vec, Doc2Vec, FastText in Python
stanfordnlp/GloVe — official GloVe C implementation + pretrained vectors
facebookresearch/fastText — official FastText; 157 language pretrained models
karpathy/makemore — character-level LM that evolves from n-grams to transformers

Blog Posts

Jay Alammar — "The Illustrated Word2Vec" (2019) — best visual walkthrough
Sebastian Ruder — "On Word Embeddings" parts 1–3
Chris McCormick — "Word2Vec Tutorial — The Skip-gram Model" (2016)

9. Interview Prep

Why does Word2Vec work? What objective is it actually optimising?
Skip-gram with negative sampling approximates maximising the log-probability of (center, context) pairs vs. random noise pairs. Levy & Goldberg (2014) showed it implicitly factorises a shifted PMI matrix: each dot product w_i · c_j converges to PMI(i,j) − log k. Geometry emerges because PMI captures distributional similarity.
What is the 3/4-power in the noise distribution and why?
P_noise(w) ∝ count(w)^{0.75}. Raw count would over-sample very frequent words (the, a, is) as negatives — they provide weak training signal because the model already assigns them low dot product with any content word. The 0.75 exponent smoothes toward uniform, giving rare words more negative-sampling exposure. Mikolov found this empirically; no closed-form justification exists.
What is the difference between GloVe and Word2Vec at a mathematical level?
Word2Vec: SGD on local (center, context) windows — implicitly factorises PMI. GloVe: weighted least-squares on the global log co-occurrence matrix — directly factorises log X. GloVe uses every co-occurrence count once (weighted by f(X_ij)); Word2Vec samples pairs proportionally to co-occurrence frequency. In practice both produce similar quality; GloVe is more reproducible, Word2Vec scales better to 100B+ tokens via streaming.
Why do analogy tasks work and when do they fail?
3COSADD: argmax_b* cos(b*, a* − a + b). Works when the analogy relation is linear in embedding space — a necessary but not sufficient condition for distributional similarity to imply semantic structure. Fails for: low-frequency words (sparse training signal), polysemous anchor words, relations that are non-linear (e.g., antonyms which appear in similar contexts → small distance in embedding space).
How does FastText handle OOV at inference time?
It decomposes the OOV word into its character n-grams (same n range used at training time, e.g., 3-6), looks up each n-gram embedding, and sums them. The OOV word has no dedicated embedding — it is composed on the fly. Words sharing morphological roots automatically share most n-gram embeddings, so the resulting vector is semantically reasonable.
Why don't static embeddings handle polysemy, and how did ELMo fix this?
Word2Vec/GloVe assign one vector per word type, computed as a weighted average over all senses in the training corpus. "bank" ends up between its financial and geographic senses. ELMo (Peters et al. 2018) computes embeddings from the hidden states of a bidirectional LSTM trained on a language modeling objective — so the embedding of "bank" in "river bank" and "bank account" differs because the LSTM hidden state captures left/right context. BERT generalised this to transformers.
If you had to pick one embedding method for a production multilingual NER system today, which would you choose and why?
Neither — you'd use a pretrained multilingual BERT (mBERT) or XLM-R, which are contextualised and cover 100+ languages. But if forced to static: FastText's pretrained cc.* vectors for 157 languages, because they handle morphology and OOV, both common in non-English languages. FastText vectors are still the recommended baseline in the spaCy pipeline for languages without transformer coverage.