Part II — Mathematical Foundations

§ 3 · Linear Algebra for ML

Vectors, matrices, eigendecomposition, SVD, and matrix calculus — the language every transformer operation is written in.

1. Overview

Every LLM forward pass is a cascade of matrix operations. Token embeddings are row lookups in a V × d_model weight matrix. Attention reduces to batched matrix products QKᵀ / √d_k. The feed-forward layer is two back-to-back linear maps with a nonlinearity between them. Gradient descent updates each weight via the chain rule, which in matrix form becomes outer-product accumulation. LoRA fine-tuning works because weight-update matrices are empirically low-rank — a direct application of SVD. Fluency in these four pillars — vector algebra, matrix algebra, SVD, and matrix calculus — is the entry ticket for reading LLM papers without stopping at every paragraph.

2. Key Concepts

Vectors and Norms

A vector x ∈ ℝⁿ is a column of scalars. Three norms appear throughout ML:

Norm	Formula	Used for in ML
L1	∑ \|xᵢ\|	sparsity (Lasso regularization)
L2	√(∑ xᵢ²)	Euclidean distance, weight decay, gradient clipping
L∞	max \|xᵢ\|	quantization range, clipping threshold

Cosine similarity: cos(u, v) = (u·v) / (‖u‖₂ · ‖v‖₂) — measures directional alignment regardless of magnitude. Used in embedding nearest-neighbor search, attention pattern analysis, and evaluating whether two token representations encode the same concept.

Matrix Multiplication and Shape Algebra

The inner dimensions must match: [M×K] @ [K×N] → [M×N]. Each output element is a dot product of K values, so total FLOPs = 2MKN (one multiply and one add per pair). PyTorch`s @ operator broadcasts over arbitrary leading batch dimensions, making attention`s[B,H,T,dk] @ [B,H,dk,T] a single call.

Operation	PyTorch call	Shape	FLOPs
Linear layer	F.linear(x, W, b)	[B,T,D] → [B,T,D_out]	2·B·T·D·D_out
Attention scores	Q @ K.transpose(-2,-1)	[B,H,T,dk] → [B,H,T,T]	2·B·H·T²·dk
Attention output	softmax(scores) @ V	[B,H,T,T]@[B,H,T,dk]	2·B·H·T²·dk
FFN (d_ff = 4D)	x@W1.T ; h@W2.T	[B,T,D]→[B,T,4D]→[B,T,D]	2·8·B·T·D² = 16BTD²

Rank, Trace, and Frobenius Norm

The rank of a matrix is the dimension of its column space — the number of linearly independent columns (= rows). A full-rank weight matrix has rank = min(rows, cols). Empirically, weight updates during fine-tuning concentrate in a low-dimensional subspace (rank ≪ full), which is the empirical foundation of LoRA.

The trace tr(A) = ∑ Aᵢᵢ = ∑ λᵢ is the sum of diagonal entries, equal to the sum of eigenvalues. It appears in Fisher information and nuclear-norm regularization.

The Frobenius norm ‖A‖_F = √(∑ᵢⱼ Aᵢⱼ²) = √(tr(AᵀA)) = √(∑ σᵢ²) is the most common matrix norm in DL: L2 weight decay penalizes it, reconstruction error in SVD truncation measures it, and gradient clipping uses it as the global norm.

Eigendecomposition

For a square symmetric matrix A, the eigendecomposition is A = QΛQᵀ, where columns of Q are orthonormal eigenvectors and Λ is diagonal with eigenvalues λ₁ ≥ λ₂ ≥ … . The eigenvalue λᵢ tells you how much A stretches a vector in the direction of eigenvector qᵢ. A matrix is positive semi-definite (PSD) iff all λᵢ ≥ 0.

PCA via eigendecomposition. Given centered data X ∈ ℝ^(n×d), covariance C = XᵀX/n. Its eigenvectors are the principal components (directions of maximum variance). Project X onto the top-k eigenvectors to reduce dimensionality — used in attention head analysis, activation steering, and probing classifiers for mechanistic interpretability.

Matrix Calculus

Backpropagation is matrix calculus repeated through each layer. For the linear layer y = Wx + bthe three key gradients are:

∂L/∂W  =  (∂L/∂y) ⊗ xᵀ        shape: [d_out × d_in]   ← outer product of two vectors
∂L/∂x  =  Wᵀ · (∂L/∂y)         shape: [d_in]            ← backward pass through W
∂L/∂b  =  ∂L/∂y                 shape: [d_out]

chain rule across layers:  ∂L/∂xˡ = (Wˡ)ᵀ · δˡ   where δˡ = ∂L/∂yˡ

Each per-sample gradient ∂L/∂W is a rank-1 matrix (outer product of two vectors). Accumulated over a mini-batch of B samples, it becomes a sum of B rank-1 matrices. In practice the accumulated gradient still lies in a low-rank subspace relative to the full d_out×d_in weight matrix — this empirical fact directly motivates LoRA.

3. Core Mechanism: SVD and Low-Rank Approximation

SVD is the most powerful matrix decomposition. Unlike eigendecomposition it works on any matrix — rectangular or square, symmetric or not — and it is the mathematical foundation of PCA, pseudo-inverse computation, matrix compression, and LoRA.

Background

A pre-trained weight matrix W ∈ ℝ^(d×d) has ~10⁸ parameters for d = 10,000. Fine-tuning all of them is expensive and often leads to catastrophic forgetting. Aghajanyan et al. (2020) showed that the weight changes ΔW during fine-tuning have an intrinsic dimension far smaller than d × d — they concentrate in a subspace of rank r ≈ 4–64. If we parameterize ΔW = B·A with B ∈ ℝ^(d×r) and A ∈ ℝ^(r×d), trainable parameters drop from d² ≈ 10⁸ to 2dr ≈ 10⁵ — a 1000× reduction. The Eckart-Young theorem (the heart of SVD truncation) says this is the best possible rank-r approximation in Frobenius norm.

Plan

Compute full SVD: A = UΣVᵀ, where U ∈ ℝ^(m×m) is orthonormal, Σ ∈ ℝ^(m×n) is diagonal with σ₁ ≥ σ₂ ≥ … ≥ 0, and Vᵀ ∈ ℝ^(n×n) is orthonormal.
Truncate: keep only the top-k singular triplets (uᵢ, σᵢ, vᵢ). The Eckart-Young theorem guarantees Aₖ = UₖΣₖVₖᵀ minimizes ‖A − Aₖ‖_F over all rank-k matrices.
Reconstruct: Aₖ = Uₖ ∈ ℝ^(m×k) · diag(σ₁,…,σₖ) · Vₖᵀ ∈ ℝ^(k×n).
Store as a LoRA pair: B = Uₖ · √Σₖ ∈ ℝ^(m×k) and A = √Σₖ · Vₖᵀ ∈ ℝ^(k×n), so B·A = Aₖ exactly.

Walkthrough — 4×4 matrix at rank 2

Initial conditions: A is a 4×4 matrix (e.g. d_in = d_out = 4) with singular values σ = [5.2, 3.1, 0.4, 0.1] and we target rank k = 2.

Step 1 — Full SVD
  A (4×4) = U (4×4) · diag(5.2, 3.1, 0.4, 0.1) · Vᵀ (4×4)

Step 2 — Truncate to k=2
  Uₖ  = U[:, :2]    shape: [4,2]
  Σₖ  = diag(5.2, 3.1)
  Vₖᵀ = Vᵀ[:2, :]   shape: [2,4]

Step 3 — Reconstruction quality
  Explained variance:
    (5.2² + 3.1²) / (5.2² + 3.1² + 0.4² + 0.1²)
    = (27.04 + 9.61) / (27.04 + 9.61 + 0.16 + 0.01)
    = 36.65 / 36.82  ≈ 99.5%

Step 4 — LoRA parametrisation
  B = Uₖ · √Σₖ    shape: [4,2]   (8 params)
  A = √Σₖ · Vₖᵀ   shape: [2,4]   (8 params)
  B @ A = Aₖ  ✓   total: 16 params vs 16 (trivial here at d=4)

  At real scale d=4096, r=8:
    full rank: 4096² = 16,777,216 params
    LoRA:      8 × (4096+4096) = 65,536 params  →  256× compression

LoRA Initialization in Practice

LoRA does not initialize A and B from an explicit SVD of the original W. Instead it initializes A with random Gaussian noise and B with zeros — ensuring ΔW = B·A = 0 at training start so the model starts from the exact pretrained checkpoint. The low-rank structure emerges naturally as gradient descent updates A and B: the gradient of L with respect to B is a rank-r outer product, maintaining the low-rank bias throughout training.

4. Minimal Demos

Demo 1 — Matrix Shape Playground

Enter B T D H (batch size, sequence length, model dimension, number of heads). The demo computes output shapes and FLOPs for projection, multi-head attention scores, attention output, and FFN — making abstract shape algebra tangible.

Matrix Shape Playground — C Demo

stdin (optional)

Demo 2 — SVD Truncation Explorer

Enter rows cols rank_k. The demo runs full SVD on a random matrix, builds the rank-k approximation, and reports reconstruction error, explained variance, and storage savings. Modify rank_k from 1 to min(rows,cols) to see the quality-compression tradeoff at the heart of LoRA.

SVD Truncation Explorer — C Demo

stdin (optional)

5. Production & Source Pointers

Operation	PyTorch symbol	Source / backend
Dense linear layer	torch.nn.functional.linear	torch/nn/functional.py → cuBLAS gemm
Batched matmul	torch.bmm / torch.matmul	aten/src/ATen/native/LinearAlgebra.cpp
SVD	torch.linalg.svd	→ cuSOLVER gesvd (GPU) / LAPACK dgesdd (CPU)
Symmetric eigendecomp	torch.linalg.eigh	→ cuSOLVER dsyevd (GPU) / LAPACK dsyevd (CPU)
LoRA adapter layer	peft.tuners.lora.Linear	peft/tuners/lora/layer.py
Gradient (autograd)	loss.backward()	torch/autograd/engine.py (C++ AccumulateGrad)

Under the hood: torch.linalg.svd dispatches to cusolverDnSgesvdj (Jacobi SVD on GPU, faster for small matrices) or cusolverDnSgesvd (divide-and-conquer for large ones). Computing SVD of a full d×d weight matrix is O(d³) — prohibitive for d = 4096+ at every step. LoRA avoids it entirely by directly parameterizing B and A and letting SGD find the optimal low-rank decomposition.

6. References

Papers

Hu et al. 2021 — LoRA: Low-Rank Adaptation of Large Language Models (arXiv:2106.09685)
Aghajanyan et al. 2020 — Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning (arXiv:2012.13255)
Vaswani et al. 2017 — Attention is All You Need (arXiv:1706.03762) — √d_k scaling derivation
Liu et al. 2024 — DoRA: Weight-Decomposed Low-Rank Adaptation (arXiv:2402.09353)
Zhao et al. 2024 — GaLore: Memory-Efficient LLM Training via Gradient Low-Rank Projection (arXiv:2403.03507)

Lectures

MIT 18.06 — Gilbert Strang, Linear Algebra (full lecture videos on OpenCourseWare)
MIT 18.065 — Strang, Matrix Methods in Data Analysis, Signal Processing, and ML
Stanford CS229 — Zico Kolter & Chuong Do, Linear Algebra Review and Reference (course notes, free)
CMU 10-301/601 — Linear Algebra and Calculus refresher slides
3Blue1Brown — Essence of Linear Algebra (YouTube series, chapters 1–15)

Textbooks

Strang — Introduction to Linear Algebra (5th ed.) — eigenvectors, SVD, and least squares from first principles
Goodfellow, Bengio & Courville — Deep Learning, Ch. 2 (Linear Algebra) & Ch. 6 (Backprop) — free at deeplearningbook.org
Boyd & Vandenberghe — Introduction to Applied Linear Algebra — free at vmls-book.stanford.edu
Petersen & Pedersen — The Matrix Cookbook — free PDF, definitive gradient identity reference

Code / Repos

microsoft/LoRA — original LoRA implementation and GLUE / GPT-3 experiments
huggingface/peft — LoRA / QLoRA / DoRA in production; peft/tuners/lora/layer.py
karpathy/micrograd — manual autograd in 150 lines; see matrix calculus in backward()

Blog Posts

Jeremy Jordan — An Introduction to Matrix Calculus (jeremyjordan.me)
Lilian Weng — Large Language Model: Long Context (lilianweng.github.io) — SVD in context compression
Hugging Face Blog — Understanding LoRA and QLoRA
Sebastian Raschka — Parameter-Efficient LLM Fine-tuning with LoRA (magazine.sebastianraschka.com)

7. Interview Prep

Questions ML / LLM interviewers actually ask in the linear-algebra round — at Anthropic, OpenAI, Google DeepMind, and Meta AI.

Q1. Why do we divide attention scores by √d_k? Derive it from first principles.

If q, k ~ N(0, I), each element of qᵀk is a product of two unit-normal random variables, so Var(qᵢkᵢ) = 1. The dot product qᵀk = ∑ᵢ qᵢkᵢ is a sum of d_k independent terms with Var = d_k, so Std = √d_k. Dividing by √d_k restores Std = 1. Without the scaling, large d_k pushes the dot products into regions where softmax saturates (all probability mass on one token, near-zero gradients everywhere else).

Q2. What is SVD and how does LoRA exploit it?

SVD factorizes any matrix A = UΣVᵀ (U, Vᵀ orthonormal; Σ diagonal with σ₁ ≥ σ₂ ≥ … ≥ 0). Truncating to the top-k singular values gives the optimal rank-k approximation in Frobenius norm (Eckart-Young theorem). LoRA exploits the empirical observation that weight update matrices ΔW during fine-tuning are low-rank: it parameterizes ΔW = BA with rank r ≪ d, reducing trainable parameters from d² to 2dr — often 100–1000×.

Q3. Derive ∂L/∂W for a single linear layer y = Wx (scalar loss L).

Each output yᵢ = ∑_j Wᵢⱼ xⱼ, so ∂yᵢ/∂Wᵢⱼ = xⱼ (and 0 for i′ ≠ i). By chain rule, ∂L/∂Wᵢⱼ = (∂L/∂yᵢ)(∂yᵢ/∂Wᵢⱼ) = (∂L/∂yᵢ) xⱼ. In matrix form this is the outer product (∂L/∂y) xᵀ — a rank-1 matrix of shape [d_out × d_in]. Accumulate over a batch of B samples: ∂L/∂W = (1/B) ∑_b (δ_b ⊗ x_b), a sum of B rank-1 matrices.

Q4. What is the condition number of a matrix and why does it matter for optimization?

κ(A) = σ_max / σ_min. A large κ means the loss landscape has vastly different curvatures in different directions — steep ravines. SGD oscillates in high-curvature directions and crawls in low-curvature ones. Adam`s per-parameter adaptive learning rate effectively normalizes by local curvature, making it far more robust to ill-conditioned Hessians than vanilla SGD.

Q5. What is the Frobenius norm and where does it appear in ML?

‖A‖_F = √(∑ᵢⱼ Aᵢⱼ²) = √(∑ σᵢ²). Appears in: (1) L2 weight decay — penalizes ‖W‖_F²; (2) gradient clipping — clips global ‖grad‖_F; (3) SVD truncation error — Eckart-Young minimizes ‖A − Aₖ‖_F; (4) LoRA fidelity — ‖W_fine - W_0 - BA‖_F measures adapter approximation quality; (5) spectral normalization — normalizes layers by σ_max = ‖W‖_2 ≤ ‖W‖_F.

Q6. How does the Jacobian explain vanishing gradients in deep RNNs?

In a T-step RNN, ∂L/∂h₁ = (∏ₜ Jₜ)(∂L/∂hₜ) where Jₜ = ∂hₜ/∂hₜ₋₁ is the per-step Jacobian. The product of T Jacobians raises their eigenvalues to the T-th power. If the spectral radius ρ(Jₜ) is less than 1, gradients shrink exponentially to zero (vanishing). If greater than 1, they explode. LSTM gating and the transformer`s residual connections both add an identity component to the Jacobian (Jₜ ≈ I + small), keeping ρ near 1 and enabling gradient flow over long sequences.

Q7. Describe PCA from a linear algebra perspective. Why is SVD preferred over eigendecomposition of XᵀX?

PCA of centered data X: eigendecompose covariance C = XᵀX/n = QΛQᵀ; principal directions are columns of Q. Equivalently, SVD X = UΣVᵀ gives the same result: columns of V are principal directions, σᵢ²/(n-1) = λᵢ. SVD is numerically preferred because forming XᵀX squares the condition number — if κ(X) = 10⁶, then κ(XᵀX) = 10¹², introducing catastrophic cancellation in floating-point arithmetic. SVD operates directly on X without squaring.

Q8. What does the residual stream view of a transformer mean in linear algebra terms?

Each token representation accumulates additive updates: x_out = x_in + Attn(x_in) + FFN(x_in + Attn(…)). The final representation is a sum of contributions from every layer. Each attention head writes a low-rank update W_O · Attn_head to the residual stream (W_O is the output projection, typically rank d_head). Because everything is additive, you can analyze each head`s contribution independently via SVD of its W_OW_V product — this is the foundation of mechanistic interpretability (Elhage et al. 2021 "A Mathematical Framework for Transformer Circuits").