§ 3 · Linear Algebra for ML
Vectors, matrices, eigendecomposition, SVD, and matrix calculus — the language every transformer operation is written in.
1. Overview
Every LLM forward pass is a cascade of matrix operations. Token embeddings are row lookups in a V × d_model weight matrix. Attention reduces to batched matrix products QKᵀ / √d_k. The feed-forward layer is two back-to-back linear maps with a nonlinearity between them. Gradient descent updates each weight via the chain rule, which in matrix form becomes outer-product accumulation. LoRA fine-tuning works because weight-update matrices are empirically low-rank — a direct application of SVD. Fluency in these four pillars — vector algebra, matrix algebra, SVD, and matrix calculus — is the entry ticket for reading LLM papers without stopping at every paragraph.
2. Key Concepts
Vectors and Norms
A vector x ∈ ℝⁿ is a column of scalars. Three norms appear throughout ML:
| Norm | Formula | Used for in ML |
|---|---|---|
| L1 | ∑ |xᵢ| | sparsity (Lasso regularization) |
| L2 | √(∑ xᵢ²) | Euclidean distance, weight decay, gradient clipping |
| L∞ | max |xᵢ| | quantization range, clipping threshold |
Cosine similarity: cos(u, v) = (u·v) / (‖u‖₂ · ‖v‖₂) — measures directional alignment regardless of magnitude. Used in embedding nearest-neighbor search, attention pattern analysis, and evaluating whether two token representations encode the same concept.
Matrix Multiplication and Shape Algebra
The inner dimensions must match: [M×K] @ [K×N] → [M×N]. Each output element is a dot product of K values, so total FLOPs = 2MKN (one multiply and one add per pair). PyTorch`s @ operator broadcasts over arbitrary leading batch dimensions, making attention`s[B,H,T,dk] @ [B,H,dk,T] a single call.
| Operation | PyTorch call | Shape | FLOPs |
|---|---|---|---|
| Linear layer | F.linear(x, W, b) | [B,T,D] → [B,T,D_out] | 2·B·T·D·D_out |
| Attention scores | Q @ K.transpose(-2,-1) | [B,H,T,dk] → [B,H,T,T] | 2·B·H·T²·dk |
| Attention output | softmax(scores) @ V | [B,H,T,T]@[B,H,T,dk] | 2·B·H·T²·dk |
| FFN (d_ff = 4D) | x@W1.T ; h@W2.T | [B,T,D]→[B,T,4D]→[B,T,D] | 2·8·B·T·D² = 16BTD² |
Rank, Trace, and Frobenius Norm
The rank of a matrix is the dimension of its column space — the number of linearly independent columns (= rows). A full-rank weight matrix has rank = min(rows, cols). Empirically, weight updates during fine-tuning concentrate in a low-dimensional subspace (rank ≪ full), which is the empirical foundation of LoRA.
The trace tr(A) = ∑ Aᵢᵢ = ∑ λᵢ is the sum of diagonal entries, equal to the sum of eigenvalues. It appears in Fisher information and nuclear-norm regularization.
The Frobenius norm ‖A‖_F = √(∑ᵢⱼ Aᵢⱼ²) = √(tr(AᵀA)) = √(∑ σᵢ²) is the most common matrix norm in DL: L2 weight decay penalizes it, reconstruction error in SVD truncation measures it, and gradient clipping uses it as the global norm.
Eigendecomposition
For a square symmetric matrix A, the eigendecomposition is A = QΛQᵀ, where columns of Q are orthonormal eigenvectors and Λ is diagonal with eigenvalues λ₁ ≥ λ₂ ≥ … . The eigenvalue λᵢ tells you how much A stretches a vector in the direction of eigenvector qᵢ. A matrix is positive semi-definite (PSD) iff all λᵢ ≥ 0.
Matrix Calculus
Backpropagation is matrix calculus repeated through each layer. For the linear layer y = Wx + bthe three key gradients are:
∂L/∂W = (∂L/∂y) ⊗ xᵀ shape: [d_out × d_in] ← outer product of two vectors ∂L/∂x = Wᵀ · (∂L/∂y) shape: [d_in] ← backward pass through W ∂L/∂b = ∂L/∂y shape: [d_out] chain rule across layers: ∂L/∂xˡ = (Wˡ)ᵀ · δˡ where δˡ = ∂L/∂yˡ
Each per-sample gradient ∂L/∂W is a rank-1 matrix (outer product of two vectors). Accumulated over a mini-batch of B samples, it becomes a sum of B rank-1 matrices. In practice the accumulated gradient still lies in a low-rank subspace relative to the full d_out×d_in weight matrix — this empirical fact directly motivates LoRA.
3. Core Mechanism: SVD and Low-Rank Approximation
SVD is the most powerful matrix decomposition. Unlike eigendecomposition it works on any matrix — rectangular or square, symmetric or not — and it is the mathematical foundation of PCA, pseudo-inverse computation, matrix compression, and LoRA.
Background
A pre-trained weight matrix W ∈ ℝ^(d×d) has ~10⁸ parameters for d = 10,000. Fine-tuning all of them is expensive and often leads to catastrophic forgetting. Aghajanyan et al. (2020) showed that the weight changes ΔW during fine-tuning have an intrinsic dimension far smaller than d × d — they concentrate in a subspace of rank r ≈ 4–64. If we parameterize ΔW = B·A with B ∈ ℝ^(d×r) and A ∈ ℝ^(r×d), trainable parameters drop from d² ≈ 10⁸ to 2dr ≈ 10⁵ — a 1000× reduction. The Eckart-Young theorem (the heart of SVD truncation) says this is the best possible rank-r approximation in Frobenius norm.
Plan
- Compute full SVD: A = UΣVᵀ, where U ∈ ℝ^(m×m) is orthonormal, Σ ∈ ℝ^(m×n) is diagonal with σ₁ ≥ σ₂ ≥ … ≥ 0, and Vᵀ ∈ ℝ^(n×n) is orthonormal.
- Truncate: keep only the top-k singular triplets (uᵢ, σᵢ, vᵢ). The Eckart-Young theorem guarantees Aₖ = UₖΣₖVₖᵀ minimizes ‖A − Aₖ‖_F over all rank-k matrices.
- Reconstruct: Aₖ = Uₖ ∈ ℝ^(m×k) · diag(σ₁,…,σₖ) · Vₖᵀ ∈ ℝ^(k×n).
- Store as a LoRA pair: B = Uₖ · √Σₖ ∈ ℝ^(m×k) and A = √Σₖ · Vₖᵀ ∈ ℝ^(k×n), so B·A = Aₖ exactly.
Walkthrough — 4×4 matrix at rank 2
Initial conditions: A is a 4×4 matrix (e.g. d_in = d_out = 4) with singular values σ = [5.2, 3.1, 0.4, 0.1] and we target rank k = 2.
Step 1 — Full SVD
A (4×4) = U (4×4) · diag(5.2, 3.1, 0.4, 0.1) · Vᵀ (4×4)
Step 2 — Truncate to k=2
Uₖ = U[:, :2] shape: [4,2]
Σₖ = diag(5.2, 3.1)
Vₖᵀ = Vᵀ[:2, :] shape: [2,4]
Step 3 — Reconstruction quality
Explained variance:
(5.2² + 3.1²) / (5.2² + 3.1² + 0.4² + 0.1²)
= (27.04 + 9.61) / (27.04 + 9.61 + 0.16 + 0.01)
= 36.65 / 36.82 ≈ 99.5%
Step 4 — LoRA parametrisation
B = Uₖ · √Σₖ shape: [4,2] (8 params)
A = √Σₖ · Vₖᵀ shape: [2,4] (8 params)
B @ A = Aₖ ✓ total: 16 params vs 16 (trivial here at d=4)
At real scale d=4096, r=8:
full rank: 4096² = 16,777,216 params
LoRA: 8 × (4096+4096) = 65,536 params → 256× compressionLoRA Initialization in Practice
LoRA does not initialize A and B from an explicit SVD of the original W. Instead it initializes A with random Gaussian noise and B with zeros — ensuring ΔW = B·A = 0 at training start so the model starts from the exact pretrained checkpoint. The low-rank structure emerges naturally as gradient descent updates A and B: the gradient of L with respect to B is a rank-r outer product, maintaining the low-rank bias throughout training.
4. Minimal Demos
Demo 1 — Matrix Shape Playground
Enter B T D H (batch size, sequence length, model dimension, number of heads). The demo computes output shapes and FLOPs for projection, multi-head attention scores, attention output, and FFN — making abstract shape algebra tangible.
Demo 2 — SVD Truncation Explorer
Enter rows cols rank_k. The demo runs full SVD on a random matrix, builds the rank-k approximation, and reports reconstruction error, explained variance, and storage savings. Modify rank_k from 1 to min(rows,cols) to see the quality-compression tradeoff at the heart of LoRA.
5. Production & Source Pointers
| Operation | PyTorch symbol | Source / backend |
|---|---|---|
| Dense linear layer | torch.nn.functional.linear | torch/nn/functional.py → cuBLAS gemm |
| Batched matmul | torch.bmm / torch.matmul | aten/src/ATen/native/LinearAlgebra.cpp |
| SVD | torch.linalg.svd | → cuSOLVER gesvd (GPU) / LAPACK dgesdd (CPU) |
| Symmetric eigendecomp | torch.linalg.eigh | → cuSOLVER dsyevd (GPU) / LAPACK dsyevd (CPU) |
| LoRA adapter layer | peft.tuners.lora.Linear | peft/tuners/lora/layer.py |
| Gradient (autograd) | loss.backward() | torch/autograd/engine.py (C++ AccumulateGrad) |
torch.linalg.svd dispatches to cusolverDnSgesvdj (Jacobi SVD on GPU, faster for small matrices) or cusolverDnSgesvd (divide-and-conquer for large ones). Computing SVD of a full d×d weight matrix is O(d³) — prohibitive for d = 4096+ at every step. LoRA avoids it entirely by directly parameterizing B and A and letting SGD find the optimal low-rank decomposition.6. References
Papers
- Hu et al. 2021 — LoRA: Low-Rank Adaptation of Large Language Models (arXiv:2106.09685)
- Aghajanyan et al. 2020 — Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning (arXiv:2012.13255)
- Vaswani et al. 2017 — Attention is All You Need (arXiv:1706.03762) — √d_k scaling derivation
- Liu et al. 2024 — DoRA: Weight-Decomposed Low-Rank Adaptation (arXiv:2402.09353)
- Zhao et al. 2024 — GaLore: Memory-Efficient LLM Training via Gradient Low-Rank Projection (arXiv:2403.03507)
Lectures
- MIT 18.06 — Gilbert Strang, Linear Algebra (full lecture videos on OpenCourseWare)
- MIT 18.065 — Strang, Matrix Methods in Data Analysis, Signal Processing, and ML
- Stanford CS229 — Zico Kolter & Chuong Do, Linear Algebra Review and Reference (course notes, free)
- CMU 10-301/601 — Linear Algebra and Calculus refresher slides
- 3Blue1Brown — Essence of Linear Algebra (YouTube series, chapters 1–15)
Textbooks
- Strang — Introduction to Linear Algebra (5th ed.) — eigenvectors, SVD, and least squares from first principles
- Goodfellow, Bengio & Courville — Deep Learning, Ch. 2 (Linear Algebra) & Ch. 6 (Backprop) — free at deeplearningbook.org
- Boyd & Vandenberghe — Introduction to Applied Linear Algebra — free at vmls-book.stanford.edu
- Petersen & Pedersen — The Matrix Cookbook — free PDF, definitive gradient identity reference
Code / Repos
microsoft/LoRA— original LoRA implementation and GLUE / GPT-3 experimentshuggingface/peft— LoRA / QLoRA / DoRA in production;peft/tuners/lora/layer.pykarpathy/micrograd— manual autograd in 150 lines; see matrix calculus inbackward()
Blog Posts
- Jeremy Jordan — An Introduction to Matrix Calculus (jeremyjordan.me)
- Lilian Weng — Large Language Model: Long Context (lilianweng.github.io) — SVD in context compression
- Hugging Face Blog — Understanding LoRA and QLoRA
- Sebastian Raschka — Parameter-Efficient LLM Fine-tuning with LoRA (magazine.sebastianraschka.com)
7. Interview Prep
Questions ML / LLM interviewers actually ask in the linear-algebra round — at Anthropic, OpenAI, Google DeepMind, and Meta AI.
Q1. Why do we divide attention scores by √d_k? Derive it from first principles.
If q, k ~ N(0, I), each element of qᵀk is a product of two unit-normal random variables, so Var(qᵢkᵢ) = 1. The dot product qᵀk = ∑ᵢ qᵢkᵢ is a sum of d_k independent terms with Var = d_k, so Std = √d_k. Dividing by √d_k restores Std = 1. Without the scaling, large d_k pushes the dot products into regions where softmax saturates (all probability mass on one token, near-zero gradients everywhere else).
Q2. What is SVD and how does LoRA exploit it?
SVD factorizes any matrix A = UΣVᵀ (U, Vᵀ orthonormal; Σ diagonal with σ₁ ≥ σ₂ ≥ … ≥ 0). Truncating to the top-k singular values gives the optimal rank-k approximation in Frobenius norm (Eckart-Young theorem). LoRA exploits the empirical observation that weight update matrices ΔW during fine-tuning are low-rank: it parameterizes ΔW = BA with rank r ≪ d, reducing trainable parameters from d² to 2dr — often 100–1000×.
Q3. Derive ∂L/∂W for a single linear layer y = Wx (scalar loss L).
Each output yᵢ = ∑_j Wᵢⱼ xⱼ, so ∂yᵢ/∂Wᵢⱼ = xⱼ (and 0 for i′ ≠ i). By chain rule, ∂L/∂Wᵢⱼ = (∂L/∂yᵢ)(∂yᵢ/∂Wᵢⱼ) = (∂L/∂yᵢ) xⱼ. In matrix form this is the outer product (∂L/∂y) xᵀ — a rank-1 matrix of shape [d_out × d_in]. Accumulate over a batch of B samples: ∂L/∂W = (1/B) ∑_b (δ_b ⊗ x_b), a sum of B rank-1 matrices.
Q4. What is the condition number of a matrix and why does it matter for optimization?
κ(A) = σ_max / σ_min. A large κ means the loss landscape has vastly different curvatures in different directions — steep ravines. SGD oscillates in high-curvature directions and crawls in low-curvature ones. Adam`s per-parameter adaptive learning rate effectively normalizes by local curvature, making it far more robust to ill-conditioned Hessians than vanilla SGD.
Q5. What is the Frobenius norm and where does it appear in ML?
‖A‖_F = √(∑ᵢⱼ Aᵢⱼ²) = √(∑ σᵢ²). Appears in: (1) L2 weight decay — penalizes ‖W‖_F²; (2) gradient clipping — clips global ‖grad‖_F; (3) SVD truncation error — Eckart-Young minimizes ‖A − Aₖ‖_F; (4) LoRA fidelity — ‖W_fine - W_0 - BA‖_F measures adapter approximation quality; (5) spectral normalization — normalizes layers by σ_max = ‖W‖_2 ≤ ‖W‖_F.
Q6. How does the Jacobian explain vanishing gradients in deep RNNs?
In a T-step RNN, ∂L/∂h₁ = (∏ₜ Jₜ)(∂L/∂hₜ) where Jₜ = ∂hₜ/∂hₜ₋₁ is the per-step Jacobian. The product of T Jacobians raises their eigenvalues to the T-th power. If the spectral radius ρ(Jₜ) is less than 1, gradients shrink exponentially to zero (vanishing). If greater than 1, they explode. LSTM gating and the transformer`s residual connections both add an identity component to the Jacobian (Jₜ ≈ I + small), keeping ρ near 1 and enabling gradient flow over long sequences.
Q7. Describe PCA from a linear algebra perspective. Why is SVD preferred over eigendecomposition of XᵀX?
PCA of centered data X: eigendecompose covariance C = XᵀX/n = QΛQᵀ; principal directions are columns of Q. Equivalently, SVD X = UΣVᵀ gives the same result: columns of V are principal directions, σᵢ²/(n-1) = λᵢ. SVD is numerically preferred because forming XᵀX squares the condition number — if κ(X) = 10⁶, then κ(XᵀX) = 10¹², introducing catastrophic cancellation in floating-point arithmetic. SVD operates directly on X without squaring.
Q8. What does the residual stream view of a transformer mean in linear algebra terms?
Each token representation accumulates additive updates: x_out = x_in + Attn(x_in) + FFN(x_in + Attn(…)). The final representation is a sum of contributions from every layer. Each attention head writes a low-rank update W_O · Attn_head to the residual stream (W_O is the output projection, typically rank d_head). Because everything is additive, you can analyze each head`s contribution independently via SVD of its W_OW_V product — this is the foundation of mechanistic interpretability (Elhage et al. 2021 "A Mathematical Framework for Transformer Circuits").