§ 6 · Information Theory & Numerical Computing
Entropy, cross-entropy, KL divergence, perplexity — the language of LLM evaluation. Then floating-point formats (FP32 / BF16 / FP16 / FP8), numerical pitfalls (overflow in softmax, underflow in gradients), and the tricks that keep trillion-parameter training stable.
1. Overview
Two bodies of knowledge make LLM math tractable. Information theory provides the right vocabulary: cross-entropy is literally what we minimize, perplexity is what we report, and KL divergence appears in every alignment objective (PPO, DPO) and generative model (VAE, diffusion). Numerical computing explains why we use BF16 instead of FP16 for training, why softmax silently overflows on large attention logits, and how the log-sum-exp trick saves every forward pass from NaN. The diagram below shows where both subjects intersect the LLM training pipeline.
2. Information Theory
Entropy, Cross-Entropy, KL, Mutual Information
All four measures derive from one idea: the optimal code length for an event with probability p is −log₂(p) bits. Assign short codes to likely events, long codes to rare ones. Entropy H(p) is the expected code length under the true distribution. Cross-entropy H(p,q) is the expected code length when you code using model q but data comes from p. The diagram below maps the relationships.
| Quantity | Formula | LLM role |
|---|---|---|
| H(p) | −Σ p(x) log p(x) | Uncertainty of the true distribution. Lower bound on any lossless code length. |
| H(p, q) | −Σ p(x) log q(x) | The training loss. p = one-hot label, q = softmax output. |
| KL(p ‖ q) | H(p,q) − H(p) | PPO KL penalty, DPO derivation, VAE ELBO. Always ≥ 0. Asymmetric. |
| Perplexity | exp(H(p, q)) | Standard LM metric. Average branching factor per token. |
| I(X; Y) | H(X) − H(X|Y) | Contrastive objectives (CLIP), information bottleneck, representation eval. |
Perplexity in Practice
For an LM with parameters θ evaluated on a held-out corpus of N tokens:
NLL = -1/N Σᵢ log p_θ(xᵢ | x₁…xᵢ₋₁) # avg negative log-likelihood per token PPL = exp(NLL) # perplexity = geometric mean inverse prob PPL = 1 → model assigns prob 1 to every token (perfect prediction) PPL = V → model is uniform over vocabulary size V (knows nothing) GPT-2 small (117M) on Penn Treebank: ~35 PPL GPT-2 XL (1.5B) on Penn Treebank: ~19 PPL LLaMA-3 8B on code benchmarks: ~5-8 PPL Warning: PPL is only comparable on the exact same tokenizer + test split. A larger vocabulary can produce lower PPL without being "smarter" — each token covers more characters, so per-token log-prob is naturally higher. Always report (model, tokenizer, eval dataset) together.
KL Divergence — Asymmetry Matters
KL(p ‖ q) penalizes regions where p > 0 but q ≈ 0 (model assigns low probability to true events). KL(q ‖ p) penalizes regions where q > 0 but p ≈ 0 (model wastes probability on impossible events). This asymmetry is intentional in alignment:
PPO objective: max_π E[reward(x)] - β · KL(π ‖ π_ref)
↑ maximize reward ↑ stay close to reference policy
Why KL(π ‖ π_ref) and not KL(π_ref ‖ π)?
KL(π ‖ π_ref) → ∞ whenever π assigns high prob to text that π_ref finds impossible.
This strongly prevents the policy from drifting to incoherent token sequences,
while still allowing it to up-weight rewarded completions.
DPO eliminates the explicit KL term but implicitly optimizes the same quantity
via the closed-form solution of the PPO-KL constrained optimization problem.
JS divergence = 0.5*KL(p||m) + 0.5*KL(q||m) where m = 0.5*(p+q)
Symmetric. Bounded in [0, log 2]. Used in GANs and some alignment objectives.3. Floating-Point Formats
Bit Layouts: FP32 → BF16 → FP16 → FP8
Every IEEE 754 float is stored as (−1)^sign × 2^(exponent − bias) × (1 + mantissa / 2^M). The number of exponent bits sets the dynamic range (how large or small values can be); the number of mantissa bits sets precision (how many significant digits). The critical insight: BF16 keeps FP32's 8-bit exponent and just truncates the mantissa — so it has the same dynamic range with lower precision.
| Format | Bits | S / Exp / Man | Max value | Machine ε | Use |
|---|---|---|---|---|---|
| FP32 | 32 | 1 / 8 / 23 | 3.4×10³⁸ | 1.2×10⁻⁷ | Master weights, Adam optimizer state (m and v buffers) |
| TF32 | 19 (stored as 32) | 1 / 8 / 10 | 3.4×10³⁸ | 9.8×10⁻⁴ | A100 internal tensor-core accumulation (transparent to user) |
| BF16 | 16 | 1 / 8 / 7 | 3.4×10³⁸ | 7.8×10⁻³ | Standard for LLM training — weights + activations |
| FP16 | 16 | 1 / 5 / 10 | 65504 | 9.8×10⁻⁴ | Overflow-prone; requires loss scaling. Mostly replaced by BF16. |
| FP8 E4M3 | 8 | 1 / 4 / 3 | 448 | 0.125 | H100 forward pass GEMM — weights + activations |
| FP8 E5M2 | 8 | 1 / 5 / 2 | 57344 | 0.25 | H100 backward pass GEMM — gradients (wider range needed) |
| INT8 | 8 | integer | 127 (signed) | — | Post-training quantization: SmoothQuant, GPTQ activation quant |
| INT4 | 4 | integer | 7 (signed) | — | Weight-only quant: AWQ, GGUF; consumer GPU inference |
4. Core Mechanism: Stable Softmax & Log-Sum-Exp
Background: Softmax computes exp(xᵢ) / Σⱼ exp(xⱼ). For a sequence length T=4096 and head dim dₖ=128, pre-softmax attention scores Q@Kᵀ/√dₖ can reach values of 20–50 for well-trained models. At score 88, FP32 exp() overflows to infinity. The naive formula returns NaN, and NaN gradients corrupt the entire training run silently.
Plan:
- Compute m = max(xᵢ) — the largest logit in the vector
- Shift all logits: yᵢ = xᵢ − m, so the largest element becomes 0
- Compute exp(yᵢ): all values are in (0, 1], no overflow possible
- Divide by Σ exp(yᵢ) — denominator is safe, all terms ≤ 1
- Prove equivalence: the constant exp(−m) cancels in numerator and denominator
Walkthrough — logits = [1000, 1001, 1002]:
─── Naive (broken) ─────────────────────────────────────────────────────────────
Step 1: exp([1000, 1001, 1002])
= [inf, inf, inf] (FP32 saturates at ~1.8e308; exp(710) ≈ inf)
Step 2: inf / (inf + inf + inf)
= NaN (IEEE 754: 0/0 or ∞/∞ → NaN)
→ cross_entropy = -log(NaN) = NaN → gradients NaN → optimizer corrupted
─── Stable (correct) ───────────────────────────────────────────────────────────
Step 1: m = max([1000, 1001, 1002]) = 1002
Step 2: y = x - m = [-2, -1, 0]
Step 3: exp(y) = [0.1353, 0.3679, 1.0000] all in (0, 1] → safe
Step 4: sum = 1.5032
Step 5: softmax = [0.0900, 0.2447, 0.6652] ← correct
─── Log-sum-exp (for cross-entropy) ────────────────────────────────────────────
log(Σ exp(xᵢ)) = m + log(Σ exp(xᵢ - m)) # m cancels; RHS is safe
= 1002 + log(1.5032)
= 1002 + 0.4076 = 1002.408
log_softmax[i] = xᵢ - log(Σ exp(xⱼ)) # stable: subtract scalar, no exp of large value
= [1000, 1001, 1002] - 1002.408
= [-2.408, -1.408, -0.408]
cross_entropy (target = index 2) = -log_softmax[2] = 0.408 ← correctMixed Precision Training & Loss Scaling
FP16 underflows (rounds to 0) for values below ~6×10⁻⁸. Gradient values in deep networks are often this small. Loss scaling multiplies the loss by a large scalar before backward pass, rescales gradients back before the optimizer step, and reduces the scale factor when inf/NaN appears:
# PyTorch GradScaler — dynamic loss scaling for FP16 training
scale = 32768.0 # initial scale (2^15); doubled slowly
loss_scaled = loss * scale # amplify before backward
loss_scaled.backward() # all gradients are ~true_grads * scale
if any(inf or nan in grads):
scale /= 2.0 # back off if overflow detected
optimizer.zero_grad(); continue # skip this step
grads /= scale # restore true gradient magnitude
clip_grad_norm_(params, max_norm=1.0) # clip AFTER unscaling
optimizer.step()
scale = min(scale * 2.0, 65536.0) # slowly grow scale back up
# ─── With BF16 ───────────────────────────────────────────────────────────────
# Loss scaling is NOT needed: BF16 max = 3.4e38 = same as FP32.
# torch.autocast(device_type='cuda', dtype=torch.bfloat16) is the modern recipe.
# ─── With FP8 (H100, TransformerEngine) ──────────────────────────────────────
# Per-tensor dynamic scaling: scale_factor = amax(tensor) / fp8_max.
# TransformerEngine DelayedScaling computes amax from recent steps.5. Minimal Demo
Demo A — Float Format Inspector
Enter any float to see its IEEE 754 bit decomposition in FP32, BF16, and FP16. Try 70000 to see FP16 overflow. Try 3.14 to compare mantissa precision between formats. Try 1e38 to see how BF16 handles FP32’s full range.
Demo B — Stable Softmax & Log-Sum-Exp
Enter space-separated logits to see naive vs stable softmax side by side. Try 1000 1001 1002 to trigger overflow in naive. Try 1.0 2.0 3.0 to confirm both agree on small inputs.
6. Production / Source Pointers
| Concept | Source file | Key function / note |
|---|---|---|
| Stable log-softmax | pytorch/aten/src/ATen/native/SoftMax.cpp | log_softmax_lastdim_kernel |
| Stable cross-entropy | torch/nn/functional.py | F.cross_entropy → nll_loss + log_softmax |
| FP16 loss scaling | torch/cuda/amp/grad_scaler.py | GradScaler.scale / unscale_ / step / update |
| BF16 autocast | torch/amp/autocast_mode.py | torch.autocast(dtype=torch.bfloat16) |
| FP8 mixed precision | NVIDIA/TransformerEngine | te.Linear; auto per-tensor scaling via DelayedScaling |
| Online softmax (FA) | Dao-AILab/flash-attention (csrc/flash_attn/) | Tile-by-tile max tracking — covers the same stable principle; detailed in §XIV |
7. References
Papers
- Micikevicius et al. (2017) — Mixed Precision Training (arXiv:1710.03740) — FP16 + loss scaling for deep learning
- Micikevicius et al. (2022) — FP8 Formats for Deep Learning (arXiv:2209.05433) — E4M3/E5M2 and H100 FP8 training recipe
- Dao et al. (2022) — FlashAttention (arXiv:2205.14135) — online softmax for IO-efficient attention
- Shannon (1948) — A Mathematical Theory of Communication — Bell System Technical Journal — foundational information theory
- Kullback & Leibler (1951) — On Information and Sufficiency — Annals of Mathematical Statistics
- Ouyang et al. (2022) — Training Language Models to Follow Instructions with Human Feedback (InstructGPT, arXiv:2203.02155) — PPO KL penalty
- Rafailov et al. (2023) — Direct Preference Optimization (arXiv:2305.18290) — KL in DPO derivation
- Goldberg (1991) — What Every Computer Scientist Should Know About Floating-Point Arithmetic — ACM Computing Surveys
Lectures
- MIT 6.S191 (2024) — Lecture 1: Intro to Deep Learning — softmax, cross-entropy loss derivation
- Stanford CS229 (2023) — Lecture 2: Supervised Learning — MLE, cross-entropy, logistic regression
- Stanford CS224n (2024) — Lecture 2: Language Models — perplexity definition and evaluation
- CMU 10-708 PGM (2023) — KL divergence, variational inference, ELBO
- NVIDIA GTC 2022 — FP8 Training with Transformer Engine — practical H100 FP8 recipe
- David MacKay — Cambridge Information Theory lectures — entropy, mutual information, source coding
Textbooks
- MacKay — Information Theory, Inference, and Learning Algorithms (free PDF) — Chapters 2–8
- Cover & Thomas — Elements of Information Theory (2nd ed.) — the standard graduate reference
- Bishop — Pattern Recognition and Machine Learning — Chapter 1.6 (information theory)
- Murphy — Probabilistic Machine Learning Vol 1 (2022, free draft) — Chapter 2, 5
- Goodfellow, Bengio & Courville — Deep Learning (2016) — Chapter 3.13 (information theory)
Code / Repos
NVIDIA/TransformerEngine— FP8 training on H100 with automatic per-tensor scalingpytorch/pytorch—torch.cuda.amp.GradScalerfor FP16 trainingDao-AILab/flash-attention— online softmax (csrc/flash_attn/src/flash_fwd_kernel.h)karpathy/nanoGPT— BF16 training withtorch.autocast
Blog Posts
- Hugging Face — A Visual Guide to Quantization — FP8 / INT8 / INT4 tradeoffs in practice
- Lilian Weng — The Math Behind RLHF — KL divergence in PPO and DPO
- Christopher Olah — Visual Information Theory — distill.pub-style entropy and KL visualization
- Tim Dettmers — A Gentle Introduction to 8-bit Matrix Multiplication for Transformers
- PyTorch blog — Automatic Mixed Precision examples — practical BF16 / FP16 usage
8. Interview Prep
Q1. Derive the gradient of softmax cross-entropy. What does it simplify to?
L = −Σ yᵢ log ŷᵢ where ŷ = softmax(z). ∂ŷⱼ/∂zᵢ = ŷⱼ(δᵢⱼ − ŷᵢ). Chain rule: ∂L/∂zᵢ = −Σⱼ yⱼ(∂log ŷⱼ/∂zᵢ) = −Σⱼ yⱼ(δᵢⱼ − ŷᵢ) = −yᵢ + ŷᵢ Σⱼ yⱼ = ŷᵢ − yᵢ (using Σyⱼ=1). Result: ∂L/∂z = ŷ − y — the prediction minus the one-hot label. Elegant and cheap to compute.
Q2. Why is BF16 preferred over FP16 for LLM training?
BF16 keeps FP32's 8-bit exponent giving dynamic range ±3.4×10³⁸. FP16 uses only 5 exponent bits → max ±65504. Gradient norms during early LLM training easily exceed 65504, causing FP16 to silently overflow to inf/NaN. With BF16, loss scaling is unnecessary. The cost is fewer mantissa bits (7 vs 10), but LLM training is empirically robust to this lower precision.
Q3. Explain the log-sum-exp trick and prove it gives the same result as naive softmax.
log Σ exp(xᵢ) = m + log Σ exp(xᵢ − m) where m = max(xᵢ). Proof: Σ exp(xᵢ − m) = Σ exp(xᵢ)/exp(m), so log(Σ exp(xᵢ)/exp(m)) = log(Σ exp(xᵢ)) − m, adding m gives identity. Shifting by max ensures all exp arguments ≤ 0 → no overflow. Used in F.cross_entropy and FlashAttention's tile-wise online softmax.
Q4. Compute the perplexity if the average NLL loss is 2.3 nats.
PPL = exp(NLL) = exp(2.3) ≈ 9.97 ≈ 10. Cross-check in bits: 2.3 nats ÷ ln(2) ≈ 3.32 bits/token, so the model effectively chooses among ~2^3.32 ≈ 10 equally-likely next tokens per position.
Q5. What is the difference between KL(p‖q) and KL(q‖p)? When does it matter?
KL(p‖q) → ∞ when p > 0 and q ≈ 0 (model misses true events). KL(q‖p) → ∞ when q > 0 and p = 0 (model wastes mass on impossible events). In RLHF: KL(π‖π_ref) in PPO prevents the policy from assigning probability to incoherent sequences. In variational inference: using reverse KL(q‖p) produces mode-seeking behavior, collapsing q to one posterior mode.
Q6. How does FP8 training work on H100, and what are the two FP8 formats?
H100 tensor cores support FP8 matrix multiply with FP32 accumulation. E4M3 (max 448) is used for forward-pass weights and activations where precision matters. E5M2 (max 57344) is used for backward-pass gradients where wider range is needed. NVIDIA TransformerEngine maintains per-tensor scale factors (DelayedScaling), mapping each tensor's max value into the FP8 representable range each step.
Q7. What is perplexity and what makes two perplexity numbers incomparable?
PPL = exp(−1/N Σ log p(xᵢ|x<ᵢ)) — the geometric mean of per-token inverse probability, or the average branching factor. Two PPL numbers are incomparable if they differ in: (1) tokenizer — larger vocab gives higher per-token probability by covering more characters per token; (2) test set — domain matters; (3) context window — truncating context hurts long-document models. Always report all three.
Q8. Why does PyTorch's F.cross_entropy use log_softmax internally rather than softmax + log?
log(softmax(x)ᵢ) = xᵢ − log Σ exp(xⱼ). The second term is log_sum_exp, computed stably as m + log Σ exp(xⱼ − m). Computing softmax first (exp then divide) and then log loses precision: small softmax outputs rounded to 0 in low-precision formats give log(0) = −∞. Fused log_softmax avoids materializing the softmax distribution and uses log_sum_exp directly — one fewer round-trip through low precision.