Part IV — Deep Learning

§ 12 RNN / LSTM / GRU & Pre-Transformer Sequence Models

Vanilla RNN and backpropagation through time, LSTM gating to solve vanishing gradients, GRU as a simpler alternative, seq2seq encoder-decoder with Bahdanau and Luong attention — and why transformers eventually replaced all of them at scale.

1. Overview

Before the transformer, recurrent architectures were the dominant paradigm for sequence data. An RNN maintains a hidden state that is updated at every time step, giving it theoretically unlimited context. In practice, vanilla RNNs suffer from vanishing gradients that prevent learning long-range dependencies. LSTM (1997) and GRU (2014) solved this with gating mechanisms that create a controlled gradient highway through time. By 2017, seq2seq models with attention (encoder-decoder architectures) achieved state-of-the-art machine translation — and that attention mechanism directly inspired the transformer's self-attention.

2. Vanilla RNN & Backpropagation Through Time

The vanilla RNN update equation at time step t:

h_t = tanh(W_h · h_{t-1} + W_x · x_t + b)     # shared W_h, W_x across all t
y_t = W_y · h_t + b_y                          # output at step t (optional)

Parameters:  W_h ∈ R[d_h × d_h],  W_x ∈ R[d_h × d_in],  b ∈ R[d_h]
All time steps share the same W_h and W_x — that is what makes it recurrent.

Training uses backpropagation through time (BPTT): unroll the RNN for T steps, then run standard backprop on the resulting computation graph. The gradient of the loss with respect to h_1 requires multiplying T Jacobian matrices of the tanh together:

∂L/∂h_1 = ∂L/∂h_T · (∏_{t=2}^{T} ∂h_t/∂h_{t-1})
         = ∂L/∂h_T · (∏_{t=2}^{T}  diag(1 - tanh²(z_t)) · W_h )

If ||W_h|| < 1  →  product shrinks exponentially  →  VANISHING GRADIENT
If ||W_h|| > 1  →  product grows exponentially    →  EXPLODING GRADIENT

Practical Fixes for Vanilla RNNs

Problem	Fix	Limitation
Exploding gradients	Gradient clipping: if ‖g‖ > threshold, g ← g · threshold/‖g‖	Must tune threshold; doesn't fix vanishing
Vanishing gradients	Truncated BPTT: only backprop k steps	Loses gradients beyond k — still can't learn long deps
Vanishing gradients	Careful weight init (identity or near-identity for W_h)	Fragile; doesn't scale to sequences > ~50 steps
Both	Replace tanh with ReLU in the hidden unit	Exploding gradient risk increases; training unstable
Both (fundamental)	Use LSTM or GRU gating	Proper solution — next section

3. LSTM — Core Mechanism

Background: Hochreiter & Schmidhuber (1997) identified the vanishing gradient problem precisely and proposed LSTM as a solution. The key observation: a vanilla RNN must route all information through a tanh nonlinearity at every step. The gradient is multiplicatively gated, so it either vanishes or explodes. The fix is to add an explicit cell state whose update is additive — controlled by learnable gates — so the gradient can flow backward with multiplication by only one number (the forget gate value) per step, rather than a full Jacobian.

Plan:

Introduce a separate cell state c_t that carries long-range memory — it lives alongside the hidden state h_t.
Add a forget gate f_t: how much of the previous cell state to retain (0 = erase, 1 = keep).
Add an input gate i_t and cell candidate g_t: how much new information to write.
Add an output gate o_t: how much of the cell state to expose as the hidden state h_t.
Cell state update: c_t = f_t ⊙ c_{t-1} + i_t ⊙ g_t — this is additive, so ∂c_t/∂c_{t-1} = f_t ≈ 1, providing a gradient highway.

LSTM Math — Full Equations

Combined input: z = [h_{t-1} ; x_t]      # concat prev hidden + current input

Forget gate:    f_t = σ(W_f · z + b_f)   # what fraction of c_{t-1} to erase
Input gate:     i_t = σ(W_i · z + b_i)   # how much new info to admit
Cell candidate: g_t = tanh(W_g · z + b_g) # proposed new memory content
Output gate:    o_t = σ(W_o · z + b_o)   # what fraction of c_t to expose

Cell state:     c_t = f_t ⊙ c_{t-1} + i_t ⊙ g_t   # additive update!
Hidden state:   h_t = o_t ⊙ tanh(c_t)

Gradient highway: ∂c_t/∂c_{t-1} = f_t
  When f_t ≈ 1 the gradient flows back across many steps without decay.
  The network LEARNS when to remember (f→1) and when to reset (f→0).

Walkthrough: LSTM remembering subject across a long clause

Initial conditions: Sequence is "The cat, which had been sitting on the mat for hours, <verb>". We want the model to remember "cat" (singular) by the time it predicts the verb.

Step	Token	Gate behaviour (learned)	Cell state c_t
1	The	i≈0.7 writes article info; f≈0.3 erases init zeros	c encodes article slot
2	cat	i≈0.9 writes NOUN/SINGULAR strongly; f≈0.9 keeps article	c encodes [article, singular noun]
3	,	i≈0.1 small write; f≈0.95 keeps cat info	c preserves cat memory
4–9	which … hours	f≈0.95 at each step — cell state protected; o≈0.5 lets clause context into h	c still encodes cat; h tracks local clause
10	<verb>	i writes verb features; f still ≈0.9 preserves cat	c retains singularity → model predicts 'sat' not 'sat'

Why LSTM works as a gradient highway: During backprop, the gradient of the loss w.r.t. c_{t-1} is approximately f_t (a number between 0 and 1, learned to be near 1 when memory is needed). Over 8 steps with f ≈ 0.95, the gradient decays by only 0.95^8 ≈ 0.66 — compare to a vanilla RNN where it would decay by (0.7)^8 ≈ 0.06.

4. GRU — Simplified Gating

Cho et al. (2014) introduced the Gated Recurrent Unit (GRU) while working on seq2seq translation. GRU merges the cell state and hidden state into a single vector and uses only two gates, making it faster and simpler than LSTM without a significant accuracy drop on most tasks.

Reset gate:   r_t = σ(W_r · [h_{t-1} ; x_t])     # 0 = ignore previous hidden
Update gate:  z_t = σ(W_z · [h_{t-1} ; x_t])     # 0 = use new, 1 = keep old

Candidate:    h̃_t = tanh(W · [r_t ⊙ h_{t-1} ; x_t])  # new proposal
Output:       h_t = (1 - z_t) ⊙ h̃_t + z_t ⊙ h_{t-1}  # interpolate

GRU vs LSTM:
  - One hidden vector (no separate cell state)
  - Two gates instead of three
  - ~33% fewer parameters
  - Roughly same quality; prefer LSTM when absolute accuracy matters,
    GRU when speed or simplicity is the priority

Property	Vanilla RNN	LSTM	GRU
State vectors	1 (h_t)	2 (h_t, c_t)	1 (h_t)
Gates	0	3 (f, i, o)	2 (r, z)
Params vs RNN	1×	4×	3×
Vanishing gradient	Severe	Solved (additive cell)	Solved (update gate)
Training stability	Poor	Good	Good
Long-range deps	Fails > ~20 steps	Works up to ~200 steps	Similar to LSTM
Parallelism	None (sequential)	None (sequential)	None (sequential)

5. seq2seq Encoder-Decoder & Attention

The encoder-decoder (seq2seq) architecture for machine translation was introduced by Sutskever, Vinyals & Le (2014). The encoder LSTM reads the source sentence and compresses it into a single context vector (the final hidden state h_S). The decoder LSTM generates one target token at a time, conditioned on that context vector.

ENCODER:  for x_1,...,x_S:  h_t = LSTM(x_t, h_{t-1})  → context = h_S

DECODER:  s_0 = h_S
          for each output position t:
            y_t, s_t = LSTM(y_{t-1}, s_{t-1}, context)   # teacher forcing during training
            output   = softmax(W_out · s_t)

Teacher forcing: during training, feed the GROUND TRUTH y_{t-1} as decoder input
                 (not the model's own prediction). Speeds up convergence but creates
                 a train/test mismatch — at inference the model must use its own output.

The Information Bottleneck Problem

Compressing an entire variable-length sentence into a single fixed-size vector h_S creates a bottleneck. For long sentences (30+ tokens) the encoder hidden state cannot retain enough information, causing translation quality to degrade sharply — as observed empirically by Bahdanau et al. (2014, arXiv:1409.0473).

Bahdanau Attention — the Fix

Instead of using only h_S, allow the decoder to attend over all encoder hidden states h_1...h_S using a soft weighted average. The weights (attention scores) are computed from the decoder's current query and each encoder state:

Bahdanau (additive) attention:
  e_{ti} = v^T · tanh(W_h · h_i + W_s · s_{t-1})   # one scalar per source position
  α_{ti} = softmax(e_{t1}, ..., e_{tS})              # attention weights, sum = 1
  c_t    = Σ_i α_{ti} · h_i                          # context vector for step t

Luong (multiplicative) attention:
  dot:     score(h_i, s) = h_i · s
  general: score(h_i, s) = h_i^T · W_a · s
  concat:  score(h_i, s) = v^T · tanh(W_a · [h_i ; s])   # ≈ Bahdanau

The attention map α_{ti} can be visualised as a matrix [T_out × S_in] —
it shows which source words the decoder looks at when generating each target word.

Attention as soft alignment: For English→French translation, the attention weight α_{ti} is highest when source token i is the most relevant for generating target token t. The learned attention matrices often recover linguistically meaningful alignments without any alignment supervision — a precursor to the transformer's self-attention.

6. Why Transformers Replaced RNNs at Scale

By 2017, seq2seq + attention models reached strong BLEU scores on translation, but three systemic limitations prevented them from scaling to the largest datasets and compute budgets:

Limitation	RNN / LSTM root cause	Transformer fix
No parallelism over time	h_t depends on h_{t-1}: step T cannot start until step T-1 finishes. GPU utilization is low; training is slow.	Self-attention computes all positions simultaneously. Full GPU parallelism during training.
Sequential bottleneck at inference	Prefill and decode both require stepping through T tokens one at a time.	Prefill processes all tokens in parallel (O(T) steps). Decode still sequential but KV-cache is efficient.
Quadratic path length	Info from token 1 must travel through T-1 hidden state updates to reach token T. Each hop loses fidelity.	Self-attention connects every pair of positions directly in O(1) hops. Path length = 1 regardless of distance.
Limited long-range memory	Even LSTM struggles beyond ~200 tokens in practice (forget gate too small or too large).	Global attention: every token attends to every other token with equal access regardless of distance.

Note: Recurrent architectures are not dead. Mamba (2023, arXiv:2312.00752), RWKV (2023), and RetNet (2023) revisit SSMs and linear recurrences, achieving transformer-quality results with linear inference complexity. The idea that "sequence = compute through time" is being revisited — but with hardware-aware designs, not the naïve BPTT approach.

7. Minimal Demos

Demo A — LSTM Gates Step-by-Step

Feed 5 tokens through an nn.LSTMCell and manually extract the forget, input, and output gate activations at each step. Watch how the cell norm grows as memory accumulates, and how h norm is controlled by the output gate. Change d_in and d_h to see how gate statistics shift with hidden size.

LSTM Gates Step-by-Step — C Demo

stdin (optional)

Demo B — Bahdanau vs Luong Attention

Given 6 encoder hidden states and a decoder query, compute attention weights with both Bahdanau (additive) and Luong (dot-product and general) scoring functions. The bar chart shows which source positions each variant focuses on. Try changing S and d to see how attention distributes over longer sequences.

Bahdanau vs Luong Attention — C Demo

stdin (optional)

8. Production / Source Pointers

Concept	File / function
LSTMCell forward	torch/nn/modules/rnn.py — LSTMCell.forward() and _VF.lstm_cell()
Fused LSTM kernel	aten/src/ATen/native/cudnn/RNN.cpp — cudnn_rnn() dispatches to cuDNN
LSTM with proj (LSTMCell + projection)	nn.LSTM(proj_size=k) — reduces h_t to k dims before recurrence
Bidirectional LSTM	nn.LSTM(bidirectional=True) — runs forward + backward, doubles d_h
Packed sequences for variable length	torch.nn.utils.rnn.pack_padded_sequence / pad_packed_sequence
seq2seq attention (fairseq)	fairseq/models/lstm.py — LSTMDecoder.attention (LuongAttentionLayer)
Bahdanau attention ref impl	tensorflow/nmt — attention_mechanism.py (BahdanauAttention)
Modern SSM (Mamba)	state-spaces/mamba — mamba_ssm/modules/mamba_simple.py

9. References

Papers

Hochreiter & Schmidhuber 1997 — Long Short-Term Memory (Neural Computation 9(8)). The original LSTM paper.
Cho et al. 2014 — Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation (GRU; arXiv:1406.1078)
Sutskever, Vinyals & Le 2014 — Sequence to Sequence Learning with Neural Networks (seq2seq; arXiv:1409.3215)
Bahdanau, Cho & Bengio 2014 — Neural Machine Translation by Jointly Learning to Align and Translate (additive attention; arXiv:1409.0473)
Luong, Pham & Manning 2015 — Effective Approaches to Attention-based Neural Machine Translation (dot/general attention; arXiv:1508.04025)
Graves, Mohamed & Hinton 2013 — Speech Recognition with Deep Recurrent Neural Networks (bidirectional LSTM + CTC; arXiv:1303.5778)
Pascanu, Mikolov & Bengio 2013 — On the Difficulty of Training Recurrent Neural Networks (vanishing/exploding gradient analysis; arXiv:1211.5063)
Vaswani et al. 2017 — Attention Is All You Need (Transformer; arXiv:1706.03762) — replaces RNN with self-attention
Gu & Dao 2023 — Mamba: Linear-Time Sequence Modeling with Selective State Spaces (arXiv:2312.00752)

Lectures

Stanford CS224n — Lecture 5: RNNs and Language Models and Lecture 7: Machine Translation, Attention, Subword Models (Christopher Manning)
MIT 6.S191 — Lecture 2: Recurrent Neural Networks (Alexander Amini)
NYU DS-GA 1008 — Yann LeCun: Recurrent Networks and their Applications
CMU 11-785 — Deep Learning: Recurrent Neural Networks (Bhiksha Raj)
fast.ai Practical Deep Learning — Lesson 8: RNNs from scratch (Jeremy Howard)
Karpathy — The Unreasonable Effectiveness of Recurrent Neural Networks (2015 blog post, accompanying code walk)

Textbooks

Goodfellow, Bengio & Courville — Deep Learning (MIT Press, free at deeplearningbook.org) — Chapter 10: Sequence Modeling with Recurrent and Recursive Nets
Zhang et al. — Dive Into Deep Learning (d2l.ai, free) — Chapters 9–10: RNNs, LSTMs, GRUs, seq2seq
Jurafsky & Martin — Speech and Language Processing (3rd ed., free draft) — Chapter 9: RNNs and LSTMs

Code / Repos

karpathy/char-rnn — classic character-level LSTM language model
pytorch/examples/word_language_model — LSTM language model with tied embeddings
pytorch/fairseq — production seq2seq with Bahdanau/Luong attention (fairseq/models/lstm.py)
state-spaces/mamba — modern SSM alternative to transformers (2023)

Blog Posts

Colah — Understanding LSTM Networks (colah.github.io, 2015): the definitive visual walkthrough of LSTM gates
Karpathy — The Unreasonable Effectiveness of Recurrent Neural Networks (karpathy.github.io, 2015)
Olah & Carter — Attention and Augmented Recurrent Neural Networks (distill.pub, 2016)
Lilian Weng — Attention? Attention! (lilianweng.github.io, 2018): comprehensive survey of attention variants

10. Interview Prep

Q1. Derive the vanishing gradient problem in a vanilla RNN. What quantity shrinks exponentially and why?

∂L/∂h_1 = ∂L/∂h_T · ∏_{t=2}^{T} (diag(1−tanh²(z_t)) · W_h). Each factor is the Jacobian of tanh times W_h. If the spectral radius of W_h is < 1, all eigenvalues of the product shrink geometrically → gradient vanishes. If > 1 → explodes. The tanh saturation makes eigenvalues of the Jacobian small (bounded by 1), amplifying the vanishing direction.

Q2. How does the LSTM cell state update c_t = f_t ⊙ c_{t-1} + i_t ⊙ g_t solve vanishing gradients?

∂c_t/∂c_{t-1} = f_t, a scalar-per-dimension (elementwise). When the network learns to remember (f_t ≈ 1), the gradient flows back with multiplication by ≈1 — no Jacobian of tanh in the path. Over T steps the gradient decays by ∏ f_t, which the network can learn to keep close to 1 for important information. This is the 'gradient highway' through the cell state.

Q3. What is the difference between Bahdanau and Luong attention?

Bahdanau (additive): score = v^T tanh(W_h h_i + W_s s). Has separate learned projections W_h and W_s for encoder and decoder states; uses an MLP. Luong (dot/general): score = h_i · s (dot) or h_i^T W_a s (general). Simpler, no separate MLP layer, requires encoder and decoder to have the same dimension. Luong attention is applied after the LSTM step (input-feeding); Bahdanau is computed from the previous decoder state.

Q4. What is teacher forcing? Why does it create a train/test mismatch?

During training, the ground-truth token y_{t-1} is fed as the decoder input at step t, regardless of what the model would have predicted. This makes gradient signals stable and speeds convergence. At test time, the model must feed its own (potentially wrong) previous output — so errors compound. This is called 'exposure bias'. Scheduled sampling (Bengio et al. 2015) gradually switches from teacher-forced to self-fed inputs during training to close the gap.

Q5. Name three concrete reasons transformers replaced LSTMs for large-scale language modeling.

(1) Parallelism: transformer attention computes all positions simultaneously; LSTM must step sequentially — GPUs are wasted. (2) Path length: any two tokens are connected in O(1) hops via attention vs O(T) hops through LSTM hidden states. (3) Scaling: transformer FLOPs scale as O(T²·d) which is predictable and GPU-friendly; LSTM FLOPs scale as O(T·d²) with poor GPU utilization due to sequential dependency.

Q6. What is the GRU update gate z_t doing at z_t ≈ 0 vs z_t ≈ 1?

h_t = (1−z_t) ⊙ h̃_t + z_t ⊙ h_{t-1}. z_t ≈ 1: output is nearly the previous hidden state — the GRU ignores the new input and retains memory (similar to LSTM forget gate ≈ 1). z_t ≈ 0: output is nearly the candidate h̃_t — the GRU fully updates from the new input (similar to LSTM forget gate ≈ 0 + input gate ≈ 1). GRU merges forget and input decisions into a single update gate.

Q7. Why does a seq2seq model with attention still struggle at very long sequences (1000+ tokens), even though attention attends over all encoder states?

Three reasons: (1) The encoder LSTM that produces h_1...h_S still has vanishing gradient across 1000 steps — the early encoder states h_1,...,h_k carry weak signals. (2) Attention over S=1000 is O(S²) but the RNN computation is also O(S) sequential steps — training is slow. (3) The decoder attends over all states but the query (decoder hidden state) is itself from a shallow LSTM that may not capture complex cross-sequence dependencies. Transformers fix all three via parallel self-attention on both encoder and decoder.

Q8. What is truncated BPTT and why is it used?

Instead of unrolling the full T-step sequence for backprop, truncate after k steps (typically k=32..128). This caps the length of the gradient highway — useful when T is very large (e.g., 10K characters in char-rnn). The tradeoff: gradients beyond k steps are zeroed out, so the model cannot learn dependencies longer than k. The hidden state is still carried forward (only gradients are truncated), so the model can in principle maintain representations across the full sequence, but the gradient signal that drives learning is limited to k steps.