§ 5.2 Word Embeddings: Word2Vec, GloVe, FastText
Distributional hypothesis → dense vector representations → analogy arithmetic. The mathematics that turned NLP from sparse bag-of-words into geometry.
1. Overview
John Rupert Firth wrote in 1957: "you shall know a word by the company it keeps." This distributional hypothesis — that words with similar meanings appear in similar contexts — is the foundation of every embedding method, from Word2Vec to the attention matrix inside GPT-4. Word embeddings translate that intuition into dense real-valued vectors: words with similar contexts end up nearby in ℝd, and the direction of offset between vectors encodes semantic relationships.
The lineage from static to contextualised embeddings is a straight line: Word2Vec (2013) proved that a shallow neural net trained on a prediction objective produces geometrically meaningful vectors. GloVe (2014) showed the same can be achieved by factorising a log co-occurrence matrix. FastText (2017) extended both with subword n-grams. ELMo (2018) and BERT (2018) then made embeddings context-dependent — the same word gets a different vector depending on its sentence, solving the polysemy problem.
2. Word2Vec: Skip-gram and CBOW
Mikolov et al. (2013) introduced two architectures. Both train a shallow neural network whose weight matrix becomes the word embeddings — the prediction task is just a vehicle for learning the geometry.
2.1 Skip-gram Architecture
Skip-gram takes a center word and predicts each of its context words independently. The model is a lookup table (the embedding matrix) followed by a dot product and softmax.
Skip-gram objective (full softmax):
J = -Σ_t Σ_{-k≤j≤k, j≠0} log P(w_{t+j} | w_t)
P(o | c) = exp(u_o · v_c) / Σ_{w=1}^{V} exp(u_w · v_c)
─────────────────────────────────────────────
v_c = input embedding (row of W_in)
u_o = output embedding (row of W_out)
Window size k = 2 → 4 predictions per center word2.2 Negative Sampling — Core Mechanism
Background: The full softmax denominator sums over all V ≈ 50 000 tokens at every gradient step — O(V) per update. At scale this is prohibitive: a 100M-word corpus with a window of 2 generates ~400M (center, context) pairs.
Plan:
- Treat each (center, context) pair as a positive example (label 1).
- Draw k random noise words from the corpus unigram distribution raised to the ¾ power: P_noise(w) ∝ count(w)^{3/4}. Label them 0.
- Binary-cross-entropy on positive + k negatives replaces softmax.
- Gradient cost drops from O(V) to O(k), where k = 5–20 in practice.
Negative sampling objective per positive pair (c, o):
J_NS = log σ(v_c · u_o) + Σ_{i=1}^{k} E_{n_i~P_noise} [log σ(−v_c · u_{n_i})]
σ(x) = 1/(1+exp(−x))
P_noise(w) ∝ count(w)^{0.75} — the 3/4-power "smoothes" towards uniform;
rare words get more training signal as negatives than with raw countsExample walkthrough — sentence "the cat sat", window=1, k=2:
| Step | Center | Context / Noise | Label | Loss term |
|---|---|---|---|---|
| + | cat | the | 1 | log σ(v_cat · u_the) |
| + | cat | sat | 1 | log σ(v_cat · u_sat) |
| − | cat | king (sampled) | 0 | log σ(−v_cat · u_king) |
| − | cat | walk (sampled) | 0 | log σ(−v_cat · u_walk) |
After this update: v_cat moves closer to u_the and u_sat (context it really appears with), and further from u_king and u_walk (random noise). Repeat over a billion tokens and the geometry converges to encode semantic relationships.
3. GloVe — Global Vectors
Pennington et al. (2014) observed that Word2Vec implicitly factorises a shifted pointwise mutual information (PMI) matrix via stochastic gradient descent — but does so inefficiently because it sees only a tiny local window at a time. GloVe builds the full co-occurrence matrix first (one pass over the corpus) and then fits embeddings to it directly, exploiting global statistics from the start.
GloVe objective:
J = Σ_{i,j=1}^{V} f(X_ij) (w_i · w_j + b_i + b_j − log X_ij)²
f(x) = (x / x_max)^α if x < x_max x_max=100, α=0.75
1 otherwise
Key insight — ratio of co-occurrence probabilities:
log P(solid | ice) / P(solid | steam) ≈ large positive (ice is solid)
log P(gas | ice) / P(gas | steam) ≈ large negative (steam is gas)
log P(water | ice) / P(water | steam) ≈ 0 (both relate to water)
This ratio is what the vector offset w_ice − w_steam should encode.4. FastText — Subword Embeddings
Both Word2Vec and GloVe represent each word as a single atomic unit. This breaks for morphologically rich languages and for any out-of-vocabulary (OOV) token. Bojanowski et al. (2017) decompose each word into character n-grams and represent the word as the sum of its n-gram embeddings.
| Model | OOV words | Morphology | Training cost | Best for |
|---|---|---|---|---|
| Word2Vec SG | zero vector / UNK | none | fast (neg. sampling) | English, large corpora |
| GloVe | zero vector / UNK | none | one-time matrix build | stable, pretrained vecs widely available |
| FastText | inferred from n-grams | strong | ~3× slower than W2V | Morphologically rich languages, code |
5. Analogy Arithmetic & Evaluation
The most famous property of word vectors is that vector arithmetic encodes semantic relationships: king − man + woman ≈ queen. This works because directions in embedding space encode relational structure.
5.1 Intrinsic Evaluation
Intrinsic benchmarks measure embedding quality directly, without a downstream task:
- Word Analogy (Google dataset): 19 564 analogies, 3COSADD method. Accuracy: W2V=65%, GloVe=75%, FastText=78% on syntactic analogies.
- Word Similarity (WordSim-353, SimLex-999): Spearman correlation between cosine similarity and human judgements.
- Nearest-neighbour coherence: top-10 neighbours should be semantically related.
5.2 Extrinsic Evaluation
Extrinsic evaluation plugs embeddings into a downstream task and measures that task's performance. Better embeddings → better downstream scores, but the correlation is imperfect: an embedding that wins on analogy may lose on NER. Extrinsic evaluation is more expensive but more reliable.
| Task | What is measured | Metric |
|---|---|---|
| Sentiment classification (SST-2) | Sentiment signal in embedding space | Accuracy |
| Named Entity Recognition (CoNLL-2003) | Entity boundary + type geometry | F1 |
| POS tagging (Penn Treebank) | Syntactic cluster quality | Accuracy |
5.3 Limitations of Static Embeddings
Static embeddings assign one vector per word type, regardless of context. This creates fundamental limitations:
- Polysemy: "bank" (river) and "bank" (financial) get the same vector — a weighted average of both senses.
- OOV: Word2Vec and GloVe fail on words not in the training vocabulary. FastText mitigates this but doesn't eliminate it.
- Context blindness: "I made her duck" — is "duck" a verb or noun? Static embeddings can't tell.
ELMo (2018) addressed this by using the BiLSTM hidden state at each word's position as its embedding — so the same word gets different vectors depending on its context. BERT (2018) replaced the BiLSTM with a bidirectional transformer, producing still richer context-dependent representations. These remain the same core idea: dense representation derived from prediction, now conditioned on context.
6. Minimal Demos
Demo 1 — Word2Vec Analogy Explorer
Analogy arithmetic on a hand-crafted 5-dimensional embedding space. Demonstrates the king − man + woman ≈ queen relationship via cosine nearest-neighbour search.
Demo 2 — Skip-gram Training from Scratch
Train a skip-gram model with negative sampling entirely in pure Python (no NumPy). After training, nearest-neighbour lookup shows that "cat" is close to "kitten" — a relationship only present because they co-occur in similar contexts.
7. Production / Source Pointers
| Tool | Repo / package | Key entry point |
|---|---|---|
| Word2Vec (original C) | google-code-archive word2vec | word2vec.c :: TrainModel() |
| Word2Vec (Gensim Python) | piskvorky/gensim | gensim.models.Word2Vec; .wv.most_similar() |
| GloVe | stanfordnlp/GloVe | src/glove.c :: GloVe(); pretrained vectors at nlp.stanford.edu |
| FastText | facebookresearch/fastText | fasttext.train_unsupervised(); cc.en.300.bin pretrained |
| nn.Embedding (PyTorch) | pytorch/pytorch | torch.nn.Embedding(V, d); from_pretrained(weights) |
| Evaluation harness | kudkudak/word-embeddings-benchmarks | scripts/evaluate_on_all.py |
8. References
Papers
- Mikolov, T. et al. (2013). "Efficient Estimation of Word Representations in Vector Space." arXiv:1301.3781.
- Mikolov, T. et al. (2013). "Distributed Representations of Words and Phrases and their Compositionality." arXiv:1310.4546. (Negative sampling paper)
- Pennington, J., Socher, R., Manning, C. (2014). "GloVe: Global Vectors for Word Representation." EMNLP 2014.
- Bojanowski, P. et al. (2017). "Enriching Word Vectors with Subword Information." arXiv:1607.04606. (FastText)
- Levy, O. & Goldberg, Y. (2014). "Neural Word Embedding as Implicit Matrix Factorization." NeurIPS 2014.
- Peters, M. et al. (2018). "Deep Contextualized Word Representations." arXiv:1802.05365. (ELMo)
- Firth, J. R. (1957). "A synopsis of linguistic theory 1930–55." Studies in Linguistic Analysis.
Lectures
- Stanford CS224n (Manning) — Lecture 1: Word Vectors; Lecture 2: Word Vectors 2 & Word Senses
- Stanford CS224n — Lecture notes on GloVe, negative sampling derivation
- CMU 11-711 — Advanced NLP: Representation Learning module
- NYU DS-GA 1008 (LeCun) — Lecture on energy-based learning & word vectors
Textbooks
- Jurafsky & Martin — Speech and Language Processing, 3rd ed. Chapter 6: Vector Semantics and Embeddings.
- Goldberg, Y. — Neural Network Methods for Natural Language Processing. Morgan & Claypool, 2017.
Code / Repos
piskvorky/gensim— Word2Vec, Doc2Vec, FastText in Pythonstanfordnlp/GloVe— official GloVe C implementation + pretrained vectorsfacebookresearch/fastText— official FastText; 157 language pretrained modelskarpathy/makemore— character-level LM that evolves from n-grams to transformers
Blog Posts
- Jay Alammar — "The Illustrated Word2Vec" (2019) — best visual walkthrough
- Sebastian Ruder — "On Word Embeddings" parts 1–3
- Chris McCormick — "Word2Vec Tutorial — The Skip-gram Model" (2016)
9. Interview Prep
- Why does Word2Vec work? What objective is it actually optimising?
Skip-gram with negative sampling approximates maximising the log-probability of (center, context) pairs vs. random noise pairs. Levy & Goldberg (2014) showed it implicitly factorises a shifted PMI matrix: each dot product w_i · c_j converges to PMI(i,j) − log k. Geometry emerges because PMI captures distributional similarity. - What is the 3/4-power in the noise distribution and why?
P_noise(w) ∝ count(w)^{0.75}. Raw count would over-sample very frequent words (the, a, is) as negatives — they provide weak training signal because the model already assigns them low dot product with any content word. The 0.75 exponent smoothes toward uniform, giving rare words more negative-sampling exposure. Mikolov found this empirically; no closed-form justification exists. - What is the difference between GloVe and Word2Vec at a mathematical level?
Word2Vec: SGD on local (center, context) windows — implicitly factorises PMI. GloVe: weighted least-squares on the global log co-occurrence matrix — directly factorises log X. GloVe uses every co-occurrence count once (weighted by f(X_ij)); Word2Vec samples pairs proportionally to co-occurrence frequency. In practice both produce similar quality; GloVe is more reproducible, Word2Vec scales better to 100B+ tokens via streaming. - Why do analogy tasks work and when do they fail?
3COSADD: argmax_b* cos(b*, a* − a + b). Works when the analogy relation is linear in embedding space — a necessary but not sufficient condition for distributional similarity to imply semantic structure. Fails for: low-frequency words (sparse training signal), polysemous anchor words, relations that are non-linear (e.g., antonyms which appear in similar contexts → small distance in embedding space). - How does FastText handle OOV at inference time?
It decomposes the OOV word into its character n-grams (same n range used at training time, e.g., 3-6), looks up each n-gram embedding, and sums them. The OOV word has no dedicated embedding — it is composed on the fly. Words sharing morphological roots automatically share most n-gram embeddings, so the resulting vector is semantically reasonable. - Why don't static embeddings handle polysemy, and how did ELMo fix this?
Word2Vec/GloVe assign one vector per word type, computed as a weighted average over all senses in the training corpus. "bank" ends up between its financial and geographic senses. ELMo (Peters et al. 2018) computes embeddings from the hidden states of a bidirectional LSTM trained on a language modeling objective — so the embedding of "bank" in "river bank" and "bank account" differs because the LSTM hidden state captures left/right context. BERT generalised this to transformers. - If you had to pick one embedding method for a production multilingual NER system today, which would you choose and why?
Neither — you'd use a pretrained multilingual BERT (mBERT) or XLM-R, which are contextualised and cover 100+ languages. But if forced to static: FastText's pretrained cc.* vectors for 157 languages, because they handle morphology and OOV, both common in non-English languages. FastText vectors are still the recommended baseline in the spaCy pipeline for languages without transformer coverage.