Part V — NLP Foundations

§ 5.2 Word Embeddings: Word2Vec, GloVe, FastText

Distributional hypothesis → dense vector representations → analogy arithmetic. The mathematics that turned NLP from sparse bag-of-words into geometry.

1. Overview

John Rupert Firth wrote in 1957: "you shall know a word by the company it keeps." This distributional hypothesis — that words with similar meanings appear in similar contexts — is the foundation of every embedding method, from Word2Vec to the attention matrix inside GPT-4. Word embeddings translate that intuition into dense real-valued vectors: words with similar contexts end up nearby in ℝd, and the direction of offset between vectors encodes semantic relationships.

The lineage from static to contextualised embeddings is a straight line: Word2Vec (2013) proved that a shallow neural net trained on a prediction objective produces geometrically meaningful vectors. GloVe (2014) showed the same can be achieved by factorising a log co-occurrence matrix. FastText (2017) extended both with subword n-grams. ELMo (2018) and BERT (2018) then made embeddings context-dependent — the same word gets a different vector depending on its sentence, solving the polysemy problem.

2. Word2Vec: Skip-gram and CBOW

Mikolov et al. (2013) introduced two architectures. Both train a shallow neural network whose weight matrix becomes the word embeddings — the prediction task is just a vehicle for learning the geometry.

2.1 Skip-gram Architecture

Skip-gram takes a center word and predicts each of its context words independently. The model is a lookup table (the embedding matrix) followed by a dot product and softmax.

Skip-gram objective (full softmax):
  J = -Σ_t Σ_{-k≤j≤k, j≠0} log P(w_{t+j} | w_t)

  P(o | c) = exp(u_o · v_c) / Σ_{w=1}^{V} exp(u_w · v_c)
             ─────────────────────────────────────────────
             v_c = input embedding (row of W_in)
             u_o = output embedding (row of W_out)

Window size k = 2 → 4 predictions per center word

2.2 Negative Sampling — Core Mechanism

Background: The full softmax denominator sums over all V ≈ 50 000 tokens at every gradient step — O(V) per update. At scale this is prohibitive: a 100M-word corpus with a window of 2 generates ~400M (center, context) pairs.

Plan:

  1. Treat each (center, context) pair as a positive example (label 1).
  2. Draw k random noise words from the corpus unigram distribution raised to the ¾ power: P_noise(w) ∝ count(w)^{3/4}. Label them 0.
  3. Binary-cross-entropy on positive + k negatives replaces softmax.
  4. Gradient cost drops from O(V) to O(k), where k = 5–20 in practice.
Negative sampling objective per positive pair (c, o):

  J_NS = log σ(v_c · u_o) + Σ_{i=1}^{k} E_{n_i~P_noise} [log σ(−v_c · u_{n_i})]

  σ(x) = 1/(1+exp(−x))

P_noise(w) ∝ count(w)^{0.75}  — the 3/4-power "smoothes" towards uniform;
   rare words get more training signal as negatives than with raw counts

Example walkthrough — sentence "the cat sat", window=1, k=2:

StepCenterContext / NoiseLabelLoss term
+catthe1log σ(v_cat · u_the)
+catsat1log σ(v_cat · u_sat)
catking (sampled)0log σ(−v_cat · u_king)
catwalk (sampled)0log σ(−v_cat · u_walk)

After this update: v_cat moves closer to u_the and u_sat (context it really appears with), and further from u_king and u_walk (random noise). Repeat over a billion tokens and the geometry converges to encode semantic relationships.

3. GloVe — Global Vectors

Pennington et al. (2014) observed that Word2Vec implicitly factorises a shifted pointwise mutual information (PMI) matrix via stochastic gradient descent — but does so inefficiently because it sees only a tiny local window at a time. GloVe builds the full co-occurrence matrix first (one pass over the corpus) and then fits embeddings to it directly, exploiting global statistics from the start.

GloVe objective:
  J = Σ_{i,j=1}^{V} f(X_ij) (w_i · w_j + b_i + b_j − log X_ij)²

  f(x) = (x / x_max)^α   if x < x_max     x_max=100, α=0.75
          1                otherwise

Key insight — ratio of co-occurrence probabilities:
  log P(solid | ice) / P(solid | steam)  ≈ large positive  (ice is solid)
  log P(gas   | ice) / P(gas   | steam)  ≈ large negative  (steam is gas)
  log P(water | ice) / P(water | steam)  ≈ 0               (both relate to water)

  This ratio is what the vector offset w_ice − w_steam should encode.

4. FastText — Subword Embeddings

Both Word2Vec and GloVe represent each word as a single atomic unit. This breaks for morphologically rich languages and for any out-of-vocabulary (OOV) token. Bojanowski et al. (2017) decompose each word into character n-grams and represent the word as the sum of its n-gram embeddings.

ModelOOV wordsMorphologyTraining costBest for
Word2Vec SGzero vector / UNKnonefast (neg. sampling)English, large corpora
GloVezero vector / UNKnoneone-time matrix buildstable, pretrained vecs widely available
FastTextinferred from n-gramsstrong~3× slower than W2VMorphologically rich languages, code

5. Analogy Arithmetic & Evaluation

The most famous property of word vectors is that vector arithmetic encodes semantic relationships: king − man + woman ≈ queen. This works because directions in embedding space encode relational structure.

5.1 Intrinsic Evaluation

Intrinsic benchmarks measure embedding quality directly, without a downstream task:

  • Word Analogy (Google dataset): 19 564 analogies, 3COSADD method. Accuracy: W2V=65%, GloVe=75%, FastText=78% on syntactic analogies.
  • Word Similarity (WordSim-353, SimLex-999): Spearman correlation between cosine similarity and human judgements.
  • Nearest-neighbour coherence: top-10 neighbours should be semantically related.

5.2 Extrinsic Evaluation

Extrinsic evaluation plugs embeddings into a downstream task and measures that task's performance. Better embeddings → better downstream scores, but the correlation is imperfect: an embedding that wins on analogy may lose on NER. Extrinsic evaluation is more expensive but more reliable.

TaskWhat is measuredMetric
Sentiment classification (SST-2)Sentiment signal in embedding spaceAccuracy
Named Entity Recognition (CoNLL-2003)Entity boundary + type geometryF1
POS tagging (Penn Treebank)Syntactic cluster qualityAccuracy

5.3 Limitations of Static Embeddings

Static embeddings assign one vector per word type, regardless of context. This creates fundamental limitations:

  • Polysemy: "bank" (river) and "bank" (financial) get the same vector — a weighted average of both senses.
  • OOV: Word2Vec and GloVe fail on words not in the training vocabulary. FastText mitigates this but doesn't eliminate it.
  • Context blindness: "I made her duck" — is "duck" a verb or noun? Static embeddings can't tell.

ELMo (2018) addressed this by using the BiLSTM hidden state at each word's position as its embedding — so the same word gets different vectors depending on its context. BERT (2018) replaced the BiLSTM with a bidirectional transformer, producing still richer context-dependent representations. These remain the same core idea: dense representation derived from prediction, now conditioned on context.

6. Minimal Demos

Demo 1 — Word2Vec Analogy Explorer

Analogy arithmetic on a hand-crafted 5-dimensional embedding space. Demonstrates the king − man + woman ≈ queen relationship via cosine nearest-neighbour search.

Word2Vec Analogy Arithmetic — C Demo
stdin (optional)

Demo 2 — Skip-gram Training from Scratch

Train a skip-gram model with negative sampling entirely in pure Python (no NumPy). After training, nearest-neighbour lookup shows that "cat" is close to "kitten" — a relationship only present because they co-occur in similar contexts.

Skip-gram Training from Scratch — C Demo
stdin (optional)

7. Production / Source Pointers

ToolRepo / packageKey entry point
Word2Vec (original C)google-code-archive word2vecword2vec.c :: TrainModel()
Word2Vec (Gensim Python)piskvorky/gensimgensim.models.Word2Vec; .wv.most_similar()
GloVestanfordnlp/GloVesrc/glove.c :: GloVe(); pretrained vectors at nlp.stanford.edu
FastTextfacebookresearch/fastTextfasttext.train_unsupervised(); cc.en.300.bin pretrained
nn.Embedding (PyTorch)pytorch/pytorchtorch.nn.Embedding(V, d); from_pretrained(weights)
Evaluation harnesskudkudak/word-embeddings-benchmarksscripts/evaluate_on_all.py

8. References

Papers

  • Mikolov, T. et al. (2013). "Efficient Estimation of Word Representations in Vector Space." arXiv:1301.3781.
  • Mikolov, T. et al. (2013). "Distributed Representations of Words and Phrases and their Compositionality." arXiv:1310.4546. (Negative sampling paper)
  • Pennington, J., Socher, R., Manning, C. (2014). "GloVe: Global Vectors for Word Representation." EMNLP 2014.
  • Bojanowski, P. et al. (2017). "Enriching Word Vectors with Subword Information." arXiv:1607.04606. (FastText)
  • Levy, O. & Goldberg, Y. (2014). "Neural Word Embedding as Implicit Matrix Factorization." NeurIPS 2014.
  • Peters, M. et al. (2018). "Deep Contextualized Word Representations." arXiv:1802.05365. (ELMo)
  • Firth, J. R. (1957). "A synopsis of linguistic theory 1930–55." Studies in Linguistic Analysis.

Lectures

  • Stanford CS224n (Manning) — Lecture 1: Word Vectors; Lecture 2: Word Vectors 2 & Word Senses
  • Stanford CS224n — Lecture notes on GloVe, negative sampling derivation
  • CMU 11-711 — Advanced NLP: Representation Learning module
  • NYU DS-GA 1008 (LeCun) — Lecture on energy-based learning & word vectors

Textbooks

  • Jurafsky & Martin — Speech and Language Processing, 3rd ed. Chapter 6: Vector Semantics and Embeddings.
  • Goldberg, Y. — Neural Network Methods for Natural Language Processing. Morgan & Claypool, 2017.

Code / Repos

  • piskvorky/gensim — Word2Vec, Doc2Vec, FastText in Python
  • stanfordnlp/GloVe — official GloVe C implementation + pretrained vectors
  • facebookresearch/fastText — official FastText; 157 language pretrained models
  • karpathy/makemore — character-level LM that evolves from n-grams to transformers

Blog Posts

  • Jay Alammar — "The Illustrated Word2Vec" (2019) — best visual walkthrough
  • Sebastian Ruder — "On Word Embeddings" parts 1–3
  • Chris McCormick — "Word2Vec Tutorial — The Skip-gram Model" (2016)

9. Interview Prep

  1. Why does Word2Vec work? What objective is it actually optimising?
    Skip-gram with negative sampling approximates maximising the log-probability of (center, context) pairs vs. random noise pairs. Levy & Goldberg (2014) showed it implicitly factorises a shifted PMI matrix: each dot product w_i · c_j converges to PMI(i,j) − log k. Geometry emerges because PMI captures distributional similarity.
  2. What is the 3/4-power in the noise distribution and why?
    P_noise(w) ∝ count(w)^{0.75}. Raw count would over-sample very frequent words (the, a, is) as negatives — they provide weak training signal because the model already assigns them low dot product with any content word. The 0.75 exponent smoothes toward uniform, giving rare words more negative-sampling exposure. Mikolov found this empirically; no closed-form justification exists.
  3. What is the difference between GloVe and Word2Vec at a mathematical level?
    Word2Vec: SGD on local (center, context) windows — implicitly factorises PMI. GloVe: weighted least-squares on the global log co-occurrence matrix — directly factorises log X. GloVe uses every co-occurrence count once (weighted by f(X_ij)); Word2Vec samples pairs proportionally to co-occurrence frequency. In practice both produce similar quality; GloVe is more reproducible, Word2Vec scales better to 100B+ tokens via streaming.
  4. Why do analogy tasks work and when do they fail?
    3COSADD: argmax_b* cos(b*, a* − a + b). Works when the analogy relation is linear in embedding space — a necessary but not sufficient condition for distributional similarity to imply semantic structure. Fails for: low-frequency words (sparse training signal), polysemous anchor words, relations that are non-linear (e.g., antonyms which appear in similar contexts → small distance in embedding space).
  5. How does FastText handle OOV at inference time?
    It decomposes the OOV word into its character n-grams (same n range used at training time, e.g., 3-6), looks up each n-gram embedding, and sums them. The OOV word has no dedicated embedding — it is composed on the fly. Words sharing morphological roots automatically share most n-gram embeddings, so the resulting vector is semantically reasonable.
  6. Why don't static embeddings handle polysemy, and how did ELMo fix this?
    Word2Vec/GloVe assign one vector per word type, computed as a weighted average over all senses in the training corpus. "bank" ends up between its financial and geographic senses. ELMo (Peters et al. 2018) computes embeddings from the hidden states of a bidirectional LSTM trained on a language modeling objective — so the embedding of "bank" in "river bank" and "bank account" differs because the LSTM hidden state captures left/right context. BERT generalised this to transformers.
  7. If you had to pick one embedding method for a production multilingual NER system today, which would you choose and why?
    Neither — you'd use a pretrained multilingual BERT (mBERT) or XLM-R, which are contextualised and cover 100+ languages. But if forced to static: FastText's pretrained cc.* vectors for 157 languages, because they handle morphology and OOV, both common in non-English languages. FastText vectors are still the recommended baseline in the spaCy pipeline for languages without transformer coverage.