Part II — Mathematical Foundations

§ 6 · Information Theory & Numerical Computing

Entropy, cross-entropy, KL divergence, perplexity — the language of LLM evaluation. Then floating-point formats (FP32 / BF16 / FP16 / FP8), numerical pitfalls (overflow in softmax, underflow in gradients), and the tricks that keep trillion-parameter training stable.

1. Overview

Two bodies of knowledge make LLM math tractable. Information theory provides the right vocabulary: cross-entropy is literally what we minimize, perplexity is what we report, and KL divergence appears in every alignment objective (PPO, DPO) and generative model (VAE, diffusion). Numerical computing explains why we use BF16 instead of FP16 for training, why softmax silently overflows on large attention logits, and how the log-sum-exp trick saves every forward pass from NaN. The diagram below shows where both subjects intersect the LLM training pipeline.

Why both in one section? They are inseparable in practice. Cross-entropy is both the information-theoretic loss (minimize KL between true and model distribution) and a numerical computation that naively overflows whenever logits exceed ~88 in FP32. Every LLM trainer depends on stable log-softmax to keep cross-entropy numerically valid.

2. Information Theory

Entropy, Cross-Entropy, KL, Mutual Information

All four measures derive from one idea: the optimal code length for an event with probability p is −log₂(p) bits. Assign short codes to likely events, long codes to rare ones. Entropy H(p) is the expected code length under the true distribution. Cross-entropy H(p,q) is the expected code length when you code using model q but data comes from p. The diagram below maps the relationships.

Quantity	Formula	LLM role
H(p)	−Σ p(x) log p(x)	Uncertainty of the true distribution. Lower bound on any lossless code length.
H(p, q)	−Σ p(x) log q(x)	The training loss. p = one-hot label, q = softmax output.
KL(p ‖ q)	H(p,q) − H(p)	PPO KL penalty, DPO derivation, VAE ELBO. Always ≥ 0. Asymmetric.
Perplexity	exp(H(p, q))	Standard LM metric. Average branching factor per token.
I(X; Y)	H(X) − H(X\|Y)	Contrastive objectives (CLIP), information bottleneck, representation eval.

Perplexity in Practice

For an LM with parameters θ evaluated on a held-out corpus of N tokens:

NLL = -1/N  Σᵢ log p_θ(xᵢ | x₁…xᵢ₋₁)    # avg negative log-likelihood per token
PPL = exp(NLL)                              # perplexity = geometric mean inverse prob

PPL = 1     → model assigns prob 1 to every token (perfect prediction)
PPL = V     → model is uniform over vocabulary size V (knows nothing)

GPT-2 small (117M) on Penn Treebank:   ~35 PPL
GPT-2 XL    (1.5B) on Penn Treebank:   ~19 PPL
LLaMA-3 8B         on code benchmarks: ~5-8 PPL

Warning: PPL is only comparable on the exact same tokenizer + test split.
A larger vocabulary can produce lower PPL without being "smarter" — each
token covers more characters, so per-token log-prob is naturally higher.
Always report (model, tokenizer, eval dataset) together.

KL Divergence — Asymmetry Matters

KL(p ‖ q) penalizes regions where p > 0 but q ≈ 0 (model assigns low probability to true events). KL(q ‖ p) penalizes regions where q > 0 but p ≈ 0 (model wastes probability on impossible events). This asymmetry is intentional in alignment:

PPO objective:  max_π  E[reward(x)] - β · KL(π ‖ π_ref)
               ↑ maximize reward   ↑ stay close to reference policy

Why KL(π ‖ π_ref) and not KL(π_ref ‖ π)?
  KL(π ‖ π_ref) → ∞ whenever π assigns high prob to text that π_ref finds impossible.
  This strongly prevents the policy from drifting to incoherent token sequences,
  while still allowing it to up-weight rewarded completions.

DPO eliminates the explicit KL term but implicitly optimizes the same quantity
via the closed-form solution of the PPO-KL constrained optimization problem.

JS divergence = 0.5*KL(p||m) + 0.5*KL(q||m)  where m = 0.5*(p+q)
  Symmetric. Bounded in [0, log 2]. Used in GANs and some alignment objectives.

3. Floating-Point Formats

Bit Layouts: FP32 → BF16 → FP16 → FP8

Every IEEE 754 float is stored as (−1)^sign × 2^(exponent − bias) × (1 + mantissa / 2^M). The number of exponent bits sets the dynamic range (how large or small values can be); the number of mantissa bits sets precision (how many significant digits). The critical insight: BF16 keeps FP32's 8-bit exponent and just truncates the mantissa — so it has the same dynamic range with lower precision.

Format	Bits	S / Exp / Man	Max value	Machine ε	Use
FP32	32	1 / 8 / 23	3.4×10³⁸	1.2×10⁻⁷	Master weights, Adam optimizer state (m and v buffers)
TF32	19 (stored as 32)	1 / 8 / 10	3.4×10³⁸	9.8×10⁻⁴	A100 internal tensor-core accumulation (transparent to user)
BF16	16	1 / 8 / 7	3.4×10³⁸	7.8×10⁻³	Standard for LLM training — weights + activations
FP16	16	1 / 5 / 10	65504	9.8×10⁻⁴	Overflow-prone; requires loss scaling. Mostly replaced by BF16.
FP8 E4M3	8	1 / 4 / 3	448	0.125	H100 forward pass GEMM — weights + activations
FP8 E5M2	8	1 / 5 / 2	57344	0.25	H100 backward pass GEMM — gradients (wider range needed)
INT8	8	integer	127 (signed)	—	Post-training quantization: SmoothQuant, GPTQ activation quant
INT4	4	integer	7 (signed)	—	Weight-only quant: AWQ, GGUF; consumer GPU inference

Why BF16 beats FP16 for training. Both use 16 bits, but BF16 keeps FP32’s 8-bit exponent → same dynamic range (±3.4×10³⁸). FP16’s 5-bit exponent limits it to ±65504. Gradient norms during LLM training routinely exceed 65504 in early steps → silent FP16 overflow to inf/NaN → corrupted optimizer state. With BF16 you skip loss-scaling entirely. The cost: BF16 has only 7 mantissa bits vs FP16’s 10 (lower precision per-step) — but LLM training is empirically robust to this.

4. Core Mechanism: Stable Softmax & Log-Sum-Exp

Background: Softmax computes exp(xᵢ) / Σⱼ exp(xⱼ). For a sequence length T=4096 and head dim dₖ=128, pre-softmax attention scores Q@Kᵀ/√dₖ can reach values of 20–50 for well-trained models. At score 88, FP32 exp() overflows to infinity. The naive formula returns NaN, and NaN gradients corrupt the entire training run silently.

Plan:

Compute m = max(xᵢ) — the largest logit in the vector
Shift all logits: yᵢ = xᵢ − m, so the largest element becomes 0
Compute exp(yᵢ): all values are in (0, 1], no overflow possible
Divide by Σ exp(yᵢ) — denominator is safe, all terms ≤ 1
Prove equivalence: the constant exp(−m) cancels in numerator and denominator

Walkthrough — logits = [1000, 1001, 1002]:

─── Naive (broken) ─────────────────────────────────────────────────────────────
Step 1: exp([1000, 1001, 1002])
        = [inf, inf, inf]             (FP32 saturates at ~1.8e308; exp(710) ≈ inf)
Step 2: inf / (inf + inf + inf)
        = NaN                         (IEEE 754: 0/0 or ∞/∞ → NaN)
        → cross_entropy = -log(NaN) = NaN → gradients NaN → optimizer corrupted

─── Stable (correct) ───────────────────────────────────────────────────────────
Step 1: m = max([1000, 1001, 1002]) = 1002
Step 2: y = x - m = [-2, -1, 0]
Step 3: exp(y) = [0.1353, 0.3679, 1.0000]    all in (0, 1] → safe
Step 4: sum = 1.5032
Step 5: softmax = [0.0900, 0.2447, 0.6652]   ← correct

─── Log-sum-exp (for cross-entropy) ────────────────────────────────────────────
log(Σ exp(xᵢ)) = m + log(Σ exp(xᵢ - m))    # m cancels; RHS is safe
               = 1002 + log(1.5032)
               = 1002 + 0.4076 = 1002.408

log_softmax[i] = xᵢ - log(Σ exp(xⱼ))       # stable: subtract scalar, no exp of large value
               = [1000, 1001, 1002] - 1002.408
               = [-2.408, -1.408, -0.408]

cross_entropy (target = index 2) = -log_softmax[2] = 0.408    ← correct

Mixed Precision Training & Loss Scaling

FP16 underflows (rounds to 0) for values below ~6×10⁻⁸. Gradient values in deep networks are often this small. Loss scaling multiplies the loss by a large scalar before backward pass, rescales gradients back before the optimizer step, and reduces the scale factor when inf/NaN appears:

# PyTorch GradScaler — dynamic loss scaling for FP16 training
scale = 32768.0                          # initial scale (2^15); doubled slowly

loss_scaled = loss * scale               # amplify before backward
loss_scaled.backward()                   # all gradients are ~true_grads * scale

if any(inf or nan in grads):
    scale /= 2.0                         # back off if overflow detected
    optimizer.zero_grad(); continue      # skip this step

grads /= scale                           # restore true gradient magnitude
clip_grad_norm_(params, max_norm=1.0)    # clip AFTER unscaling
optimizer.step()
scale = min(scale * 2.0, 65536.0)       # slowly grow scale back up

# ─── With BF16 ───────────────────────────────────────────────────────────────
# Loss scaling is NOT needed: BF16 max = 3.4e38 = same as FP32.
# torch.autocast(device_type='cuda', dtype=torch.bfloat16) is the modern recipe.

# ─── With FP8 (H100, TransformerEngine) ──────────────────────────────────────
# Per-tensor dynamic scaling: scale_factor = amax(tensor) / fp8_max.
# TransformerEngine DelayedScaling computes amax from recent steps.

5. Minimal Demo

Demo A — Float Format Inspector

Enter any float to see its IEEE 754 bit decomposition in FP32, BF16, and FP16. Try 70000 to see FP16 overflow. Try 3.14 to compare mantissa precision between formats. Try 1e38 to see how BF16 handles FP32’s full range.

Float Format Inspector — FP32 / BF16 / FP16 — C Demo

stdin (optional)

Demo B — Stable Softmax & Log-Sum-Exp

Enter space-separated logits to see naive vs stable softmax side by side. Try 1000 1001 1002 to trigger overflow in naive. Try 1.0 2.0 3.0 to confirm both agree on small inputs.

Stable Softmax & Log-Sum-Exp — C Demo

stdin (optional)

6. Production / Source Pointers

Concept	Source file	Key function / note
Stable log-softmax	pytorch/aten/src/ATen/native/SoftMax.cpp	`log_softmax_lastdim_kernel`
Stable cross-entropy	torch/nn/functional.py	`F.cross_entropy` → `nll_loss + log_softmax`
FP16 loss scaling	torch/cuda/amp/grad_scaler.py	`GradScaler.scale / unscale_ / step / update`
BF16 autocast	torch/amp/autocast_mode.py	`torch.autocast(dtype=torch.bfloat16)`
FP8 mixed precision	NVIDIA/TransformerEngine	`te.Linear`; auto per-tensor scaling via `DelayedScaling`
Online softmax (FA)	Dao-AILab/flash-attention (csrc/flash_attn/)	Tile-by-tile max tracking — covers the same stable principle; detailed in §XIV

7. References

Papers

Micikevicius et al. (2017) — Mixed Precision Training (arXiv:1710.03740) — FP16 + loss scaling for deep learning
Micikevicius et al. (2022) — FP8 Formats for Deep Learning (arXiv:2209.05433) — E4M3/E5M2 and H100 FP8 training recipe
Dao et al. (2022) — FlashAttention (arXiv:2205.14135) — online softmax for IO-efficient attention
Shannon (1948) — A Mathematical Theory of Communication — Bell System Technical Journal — foundational information theory
Kullback & Leibler (1951) — On Information and Sufficiency — Annals of Mathematical Statistics
Ouyang et al. (2022) — Training Language Models to Follow Instructions with Human Feedback (InstructGPT, arXiv:2203.02155) — PPO KL penalty
Rafailov et al. (2023) — Direct Preference Optimization (arXiv:2305.18290) — KL in DPO derivation
Goldberg (1991) — What Every Computer Scientist Should Know About Floating-Point Arithmetic — ACM Computing Surveys

Lectures

MIT 6.S191 (2024) — Lecture 1: Intro to Deep Learning — softmax, cross-entropy loss derivation
Stanford CS229 (2023) — Lecture 2: Supervised Learning — MLE, cross-entropy, logistic regression
Stanford CS224n (2024) — Lecture 2: Language Models — perplexity definition and evaluation
CMU 10-708 PGM (2023) — KL divergence, variational inference, ELBO
NVIDIA GTC 2022 — FP8 Training with Transformer Engine — practical H100 FP8 recipe
David MacKay — Cambridge Information Theory lectures — entropy, mutual information, source coding

Textbooks

MacKay — Information Theory, Inference, and Learning Algorithms (free PDF) — Chapters 2–8
Cover & Thomas — Elements of Information Theory (2nd ed.) — the standard graduate reference
Bishop — Pattern Recognition and Machine Learning — Chapter 1.6 (information theory)
Murphy — Probabilistic Machine Learning Vol 1 (2022, free draft) — Chapter 2, 5
Goodfellow, Bengio & Courville — Deep Learning (2016) — Chapter 3.13 (information theory)

Code / Repos

NVIDIA/TransformerEngine — FP8 training on H100 with automatic per-tensor scaling
pytorch/pytorch — torch.cuda.amp.GradScaler for FP16 training
Dao-AILab/flash-attention — online softmax (csrc/flash_attn/src/flash_fwd_kernel.h)
karpathy/nanoGPT — BF16 training with torch.autocast

Blog Posts

Hugging Face — A Visual Guide to Quantization — FP8 / INT8 / INT4 tradeoffs in practice
Lilian Weng — The Math Behind RLHF — KL divergence in PPO and DPO
Christopher Olah — Visual Information Theory — distill.pub-style entropy and KL visualization
Tim Dettmers — A Gentle Introduction to 8-bit Matrix Multiplication for Transformers
PyTorch blog — Automatic Mixed Precision examples — practical BF16 / FP16 usage

8. Interview Prep

Q1. Derive the gradient of softmax cross-entropy. What does it simplify to?

L = −Σ yᵢ log ŷᵢ where ŷ = softmax(z). ∂ŷⱼ/∂zᵢ = ŷⱼ(δᵢⱼ − ŷᵢ). Chain rule: ∂L/∂zᵢ = −Σⱼ yⱼ(∂log ŷⱼ/∂zᵢ) = −Σⱼ yⱼ(δᵢⱼ − ŷᵢ) = −yᵢ + ŷᵢ Σⱼ yⱼ = ŷᵢ − yᵢ (using Σyⱼ=1). Result: ∂L/∂z = ŷ − y — the prediction minus the one-hot label. Elegant and cheap to compute.

Q2. Why is BF16 preferred over FP16 for LLM training?

BF16 keeps FP32's 8-bit exponent giving dynamic range ±3.4×10³⁸. FP16 uses only 5 exponent bits → max ±65504. Gradient norms during early LLM training easily exceed 65504, causing FP16 to silently overflow to inf/NaN. With BF16, loss scaling is unnecessary. The cost is fewer mantissa bits (7 vs 10), but LLM training is empirically robust to this lower precision.

Q3. Explain the log-sum-exp trick and prove it gives the same result as naive softmax.

log Σ exp(xᵢ) = m + log Σ exp(xᵢ − m) where m = max(xᵢ). Proof: Σ exp(xᵢ − m) = Σ exp(xᵢ)/exp(m), so log(Σ exp(xᵢ)/exp(m)) = log(Σ exp(xᵢ)) − m, adding m gives identity. Shifting by max ensures all exp arguments ≤ 0 → no overflow. Used in F.cross_entropy and FlashAttention's tile-wise online softmax.

Q4. Compute the perplexity if the average NLL loss is 2.3 nats.

PPL = exp(NLL) = exp(2.3) ≈ 9.97 ≈ 10. Cross-check in bits: 2.3 nats ÷ ln(2) ≈ 3.32 bits/token, so the model effectively chooses among ~2^3.32 ≈ 10 equally-likely next tokens per position.

Q5. What is the difference between KL(p‖q) and KL(q‖p)? When does it matter?

KL(p‖q) → ∞ when p > 0 and q ≈ 0 (model misses true events). KL(q‖p) → ∞ when q > 0 and p = 0 (model wastes mass on impossible events). In RLHF: KL(π‖π_ref) in PPO prevents the policy from assigning probability to incoherent sequences. In variational inference: using reverse KL(q‖p) produces mode-seeking behavior, collapsing q to one posterior mode.

Q6. How does FP8 training work on H100, and what are the two FP8 formats?

H100 tensor cores support FP8 matrix multiply with FP32 accumulation. E4M3 (max 448) is used for forward-pass weights and activations where precision matters. E5M2 (max 57344) is used for backward-pass gradients where wider range is needed. NVIDIA TransformerEngine maintains per-tensor scale factors (DelayedScaling), mapping each tensor's max value into the FP8 representable range each step.

Q7. What is perplexity and what makes two perplexity numbers incomparable?

PPL = exp(−1/N Σ log p(xᵢ|x<ᵢ)) — the geometric mean of per-token inverse probability, or the average branching factor. Two PPL numbers are incomparable if they differ in: (1) tokenizer — larger vocab gives higher per-token probability by covering more characters per token; (2) test set — domain matters; (3) context window — truncating context hurts long-document models. Always report all three.

Q8. Why does PyTorch's F.cross_entropy use log_softmax internally rather than softmax + log?

log(softmax(x)ᵢ) = xᵢ − log Σ exp(xⱼ). The second term is log_sum_exp, computed stably as m + log Σ exp(xⱼ − m). Computing softmax first (exp then divide) and then log loses precision: small softmax outputs rounded to 0 in low-precision formats give log(0) = −∞. Fused log_softmax avoids materializing the softmax distribution and uses log_sum_exp directly — one fewer round-trip through low precision.