Part IV — Deep Learning

§ 9 — Neural Network from Scratch

Scalar autograd, 2-layer MLP, activations (sigmoid → SwiGLU), cross-entropy from logits, and manual backprop — Karpathy micrograd style, from first principles.

1. Overview

Every deep learning framework — PyTorch, JAX, TensorFlow — is built on the same three-part loop: a forward pass that computes predictions, a loss function that measures error, and a backward pass (autograd) that propagates gradients from the loss back to every parameter. Understanding this loop at the scalar level — before tensors, before GPUs — is the most important foundation in the field. Everything else is engineering on top of it.

This section builds autograd from scratch using scalar Values linked in a directed acyclic graph (DAG). We then stack scalars into a 2-layer MLP, train it on XOR, and arrive at exactly the code that Karpathy's micrograd implements — except you will have derived every line yourself.

2. The Autograd Engine

Autograd works in two phases. During the forward pass, each operation (add, mul, relu, …) creates a new Value node and records its inputs as children — silently building a DAG. During the backward pass, we visit nodes in reverse topological order and apply the chain rule at each one.

The key invariant: every node stores a _backward closure that, when called, accumulates its contribution to its inputs' gradients (using +=, not =). Accumulation is essential because one input can feed into multiple downstream nodes (multi-use variables).

Value node — fields & responsibilities

Field	Type	Role
data	float	Forward-pass value at this node
grad	float	dL/d(this node) — accumulated during backward
_prev	set[Value]	DAG edges: which nodes are inputs to this one
_backward	closure	Applies chain rule: accumulates grads into inputs
_op	str	Label for debugging (e.g. `'+'`, `'relu'`)

Autograd DAG for L = (a + b) × c

With a=2, b=3, c=5 → L=25. Backward seeds dL/dL = 1, then propagates: the mul node sees local grad for a+b = c = 5 and local grad for c = a+b = 5; the add node distributes 5 equally to a and b.

Local gradients for common operations

Operation	Output	d(out)/da	d(out)/db
a + b	a + b	1	1
a * b	a × b	b	a
a ** n	aⁿ	n · aⁿ⁻¹	—
relu(a)	max(0, a)	1 if a > 0 else 0	—
sigmoid(a)	σ(a)	σ(a) · (1 − σ(a))	—
log(a)	ln(a)	1 / a	—

3. Activation Functions

Without a non-linear activation between layers, any stack of linear transforms collapses to a single matrix multiplication — no representational gain from depth. The choice of activation has evolved significantly: sigmoid/tanh dominated the 1990s, ReLU unlocked deep networks in the 2010s, and smooth gated variants (GELU, SwiGLU) now define LLM FFN layers.

Detailed comparison

Name	Formula	Gradient	Used in	Key issue
Sigmoid	1/(1+e⁻ˣ)	σ(1−σ) ≤ 0.25	Output layer (binary)	Vanishing gradient; not zero-centered
Tanh	(eˣ−e⁻ˣ)/(eˣ+e⁻ˣ)	1−tanh² ≤ 1	RNN hidden states	Vanishing gradient in very deep nets
ReLU	max(0, x)	1 if x>0 else 0	ResNet, VGG, CNN hidden	Dying neurons (grad=0 for x≤0 forever)
GELU	x · Φ(x)	Φ(x) + x·φ(x)	BERT, GPT-2, GPT-3	Slightly more compute; no dying neurons
SiLU / Swish	x · σ(x)	σ(x)+x·σ(x)·(1−σ(x))	LLaMA FFN gate, EfficientNet	Non-monotone; smooth self-gated
SwiGLU	SiLU(W₁x) ⊙ (W₂x)	Gated; product of two branches	PaLM, LLaMA-2, Mistral, GPT-4	Needs 3 weight matrices; ~8/3 expansion

Why did LLMs switch from GELU to SwiGLU? Noam Shazeer's 2020 paper showed gated linear units consistently outperform un-gated versions on language modelling benchmarks. The intuition: the gating mechanism (W₂x) acts as a learned soft mask on the activated branch, giving the model additional expressiveness without extra depth.

4. The Two-Layer MLP

A 2-layer MLP (one hidden layer) with a ReLU and a sigmoid output is the simplest universal approximator. Its forward pass involves two affine transforms separated by a non-linearity. The diagram below labels every intermediate tensor's shape, which you should be able to reproduce from memory for any network depth.

Parameter count

With input dim 2, hidden dim H, output dim 1:

W1 [H×2] + b1 [H] → 2H + H = 3H params
W2 [1×H] + b2 [1] → H + 1 params
Total: 4H + 1. At H=16: 65 parameters

Why XOR needs a hidden layer

XOR is not linearly separable: no single hyperplane can separate the two positive points (0,1) and (1,0) from the two negative points (0,0) and (1,1). A 2-layer MLP learns an intermediate representation (hidden layer) that maps the XOR problem into a linearly separable space before the final classification.

5. Core Mechanism — Backpropagation

Background: We have a single logistic neuron and want to update its weight w by gradient descent. We need dL/dw. The chain rule says: follow every path from L to w, multiply local gradients along each path, and add contributions from all paths.

Plan:

Build the computation graph during the forward pass (each op records its inputs).
Compute a topological sort so we visit every node before its outputs.
Seed the output node: L.grad = 1.0.
Visit each node in reverse topo order; call its _backward() closure.
Each closure accumulates (+=) the chain-rule contribution into input gradients.

Example — sigmoid neuron + BCE, step by step

Initial state: w=0.5, x=2.0, b=−0.3, y=1 (binary label). Computation: z = w·x + b, p = σ(z), L = −log(p).

Forward pass

Node	Expression	Value
z	0.5 × 2.0 + (−0.3)	0.700
p = σ(z)	1 / (1 + e⁻⁰·⁷)	0.668
L = −log(p)	−ln(0.668)	0.404

Backward pass

Node	Local grad (d out / d input)	dL / d(node)	How
L	—	1.0	seed
p	−1/p = −1/0.668 = −1.497	−1.497	1.0 × (−1/0.668)
z	p(1−p) = 0.668×0.332 = 0.222	−0.332	−1.497 × 0.222 = p−y
w	x = 2.0	−0.664	−0.332 × 2.0
b	1	−0.332	−0.332 × 1

Classic result: For sigmoid + BCE, dL/dz = p − y. This extremely clean form arises because the gradient of log-sigmoid exactly cancels the σ(1−σ) denominator. Update rule: w -= lr × (p − y) × x.

6. Cross-Entropy & Numerical Stability

Binary cross-entropy (BCE)

For a single example with sigmoid output p = σ(z) and label y ∈ {0, 1}:

L = -[y·log(p) + (1-y)·log(1-p)]

Naive implementation computes p = sigmoid(z) first, then takes log(p). This causes numerical problems: if z is very negative, p ≈ 0 and log(0) = −∞. PyTorch's BCEWithLogitsLoss avoids this by computing directly from the logit:

L = max(z, 0) - z·y + log(1 + exp(-|z|))   # numerically stable BCE

Softmax cross-entropy (multi-class)

For C classes with logit vector z ∈ ℝᶜ and one-hot label y:

p = softmax(z)      = exp(z) / sum(exp(z))
L = cross-entropy   = -z[y] + log(sum(exp(z)))

The log-sum-exp trick stabilises the denominator. Subtracting max(z) before exp prevents overflow while leaving the result unchanged:

log(sum(exp(z))) = max(z) + log(sum(exp(z - max(z))))
                    ↑                        ↑
               stable shift          all terms ≤ 1 after shift

Always use logit-space losses in practice. PyTorch: BCEWithLogitsLoss and CrossEntropyLoss (which internally calls log_softmax) handle numerical stability for you. Never apply sigmoid/softmax before passing to these loss functions.

7. Minimal Demo

Demo 4301 — Scalar Autograd (micrograd)

A complete, self-contained implementation of scalar autograd. Runs two experiments: (1) chain rule on L=(a+b)×c, verifying gradients match manual calculation; (2) a single logistic neuron with BCE loss, confirming dL/dw = (p−y)×x.

Scalar Autograd (micrograd) — C Demo

stdin (optional)

Demo 4302 — 2-Layer MLP Trainer on XOR

A 2-layer MLP [2 → 16 → 1] trained with vanilla gradient descent on the XOR dataset. Uses Kaiming He initialisation for the weights. Prints loss and predictions every 100 epochs. You should see the loss converge and accuracy reach 100% by epoch ~200.

2-Layer MLP on XOR — C Demo

stdin (optional)

8. From Scalars to Tensors

The scalar micrograd engine is pedagogically complete, but industrial autograd engines operate on tensors (multi-dimensional arrays). The key difference:

Batching: process B examples simultaneously → one forward pass, one backward pass, gradients averaged over B.
BLAS/cuBLAS: matmul x @ W.T replaces B×H×in scalar multiplications with a single GPU kernel call.
Gradient tensors: W.grad is a tensor of the same shape as W; each entry is dL/dW[i,j].
In-place ops: PyTorch tracks tensor versions to detect in-place modifications that would corrupt the backward graph.

PyTorch's torch.autograd uses the same DAG + topo-sort principle as micrograd, but each node is a tensor operation implemented in C++/CUDA. The Python-level API you call (loss.backward()) triggers the same recursive backward walk.

9. Production / Source Pointers

Concept	Repo	File / Function
Scalar autograd (reference impl)	karpathy/micrograd	`micrograd/engine.py · Value`
Tensor autograd engine	pytorch/pytorch	`torch/csrc/autograd/engine.cpp · execute_graph_task`
Python-level backward hook	pytorch/pytorch	`torch/autograd/__init__.py · backward()`
GELU implementation	pytorch/pytorch	`aten/src/ATen/native/Activation.cpp · gelu_kernel`
SwiGLU FFN (in transformers)	meta-llama/llama	`llama/model.py · FeedForward.forward`
Numerically stable BCE	pytorch/pytorch	`torch/nn/functional.py · binary_cross_entropy_with_logits`
nanoGPT MLP block (GELU)	karpathy/nanoGPT	`model.py · MLP.forward`

10. References

Papers

Rumelhart, Hinton & Williams (1986) — Learning representations by back-propagating errors. Nature 323: 533–536. The foundational backprop paper.
He et al. (2015) — Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. arXiv:1502.01852. Introduces Kaiming initialisation and PReLU.
Glorot & Bengio (2010) — Understanding the difficulty of training deep feedforward neural networks. AISTATS 2010. Xavier initialisation, saturating activation analysis.
Nair & Hinton (2010) — Rectified Linear Units Improve Restricted Boltzmann Machines. ICML 2010.
Hendrycks & Gimpel (2016) — Gaussian Error Linear Units (GELUs). arXiv:1606.08415.
Ramachandran et al. (2017) — Searching for Activation Functions (Swish/SiLU). arXiv:1710.05941.
Shazeer (2020) — GLU Variants Improve Transformer. arXiv:2002.05202. Introduces SwiGLU.
Cybenko (1989) — Approximation by Superpositions of a Sigmoidal Function. Universal approximation theorem.

Lectures

Karpathy — Zero-to-Hero: "The spelled-out intro to neural networks and backpropagation: building micrograd" (YouTube, 2022). The canonical from-scratch walkthrough.
Stanford CS231n — Lecture 4: Backpropagation and Neural Networks. Available at cs231n.stanford.edu.
MIT 6.S191 — Lecture 1: Intro to Deep Learning (Alexander Amini). Covers forward/backward pass, activation functions.
NYU DS-GA 1008 — Yann LeCun, Lecture 3: Backpropagation. Strong emphasis on Jacobians and computational graphs.
fast.ai Practical Deep Learning for Coders — Lesson 13: Backprop from scratch (Jeremy Howard).
CMU 10-414/714 — Needle: a deep learning system from scratch. Covers autograd, tensor operations, GPU kernels.

Textbooks

Goodfellow, Bengio & Courville — Deep Learning (MIT Press, 2016). Chapters 6 (MLP), 8 (optimisation), free at deeplearningbook.org.
Zhang et al. — Dive into Deep Learning (d2l.ai). Chapters 2–5 cover autograd, linear models, MLP from scratch with interactive code.
Nielsen — Neural Networks and Deep Learning (free online). Chapter 2: How the backpropagation algorithm works — excellent four-equation derivation.

Code / Repos

karpathy/micrograd — ~100-line scalar autograd engine; the reference implementation.
karpathy/nanoGPT — GPT-2 in clean PyTorch; shows how micrograd concepts scale to transformers.
pytorch/pytorch — torch/csrc/autograd/ for the C++ engine; torch/nn/ for standard layers.
google/jax — Functional autograd via jax.grad; compare its closure-based approach to PyTorch's tape.

Blog posts

Colah — Calculus on Computational Graphs: Backpropagation (colah.github.io). Visual explanation of the topo-sort approach.
Lilian Weng — From Autoencoder to Beta-VAE; see also her backprop deep-dive series.
Simon Lacoste-Julien — A Conceptual Introduction to Hamiltonian Monte Carlo — not directly related, but her treatment of computational graphs is excellent.
PyTorch docs — Autograd mechanics (pytorch.org/docs). Explains version counters, in-place ops, and saved tensors.

11. Interview Prep

Derive backprop through sigmoid + cross-entropy from scratch. What is dL/dz?
L = −log(σ(z)) for y=1. dL/dp = −1/p, dp/dz = p(1−p), so dL/dz = (−1/p)·p(1−p) = −(1−p) = p−1. For general y: dL/dz = p − y. The clean form arises because log-sigmoid gradient exactly cancels the sigmoid derivative denominator — a happy coincidence that motivates using BCE with logits.
Why does a variable used in two places in the graph accumulate gradients with +=?
If variable a feeds into nodes f and g, the total effect of a on L is dL/da = dL/df · df/da + dL/dg · dg/da (sum of chain-rule paths). += accumulates contributions from each path. Using = (assignment) would overwrite earlier contributions — a common bug in hand-coded backprop.
Why is ReLU preferred over sigmoid in hidden layers of deep networks?
Sigmoid gradient max is 0.25; multiplied across L layers gives (0.25)^L → 0 (vanishing gradient). ReLU gradient is 1 for positive activations, so gradients propagate unchanged through active neurons. This lets networks with tens or hundreds of layers train effectively. The tradeoff: dying neurons (permanently zero gradient) mitigated by He initialisation, Leaky ReLU, or GELU.
What is the difference between forward-mode and reverse-mode autodiff?
Forward mode computes one column of the Jacobian per pass (cost ∝ number of inputs). Reverse mode computes one row per pass (cost ∝ number of outputs). For a scalar loss (1 output, many parameters), reverse mode (backprop) needs only one backward pass to compute all gradients — O(1) backward instead of O(parameters) forward passes. This is why all DL frameworks use reverse mode.
Explain the log-sum-exp trick and why it matters.
log(Σ exp(zᵢ)) is numerically unstable: exp(1000) overflows float32. Subtracting max(z): log(Σ exp(zᵢ)) = max(z) + log(Σ exp(zᵢ − max(z))). After the shift, all exponents ≤ 0, so exp values ∈ (0, 1]. The result is mathematically identical but numerically stable. PyTorch's CrossEntropyLoss uses log_softmax which implements this internally.
Why is cross-entropy preferred over MSE for classification?
MSE + sigmoid gives dL/dz = (p−y)·p(1−p). Near saturation (p≈0 or p≈1), the p(1−p) term kills gradients even when the prediction is badly wrong. Cross-entropy + sigmoid gives dL/dz = p−y — gradient is large exactly when the prediction is far from the label. BCE is also the maximum-likelihood objective under a Bernoulli model; MSE implies Gaussian noise.
What does Kaiming (He) initialisation do and why does it matter for ReLU networks?
He 2015: initialise weights from N(0, √(2/fan_in)). The factor 2 corrects for ReLU zeroing half the neurons on average, keeping the variance of activations and gradients stable across layers. Without this, deep ReLU networks either vanish (variance collapses to 0) or explode. Xavier initialisation uses √(1/fan_in), tuned for symmetric activations like sigmoid/tanh.
Why does SwiGLU need 8/3 × d_model expansion instead of 4 × d_model?
SwiGLU replaces a single FFN (two matrices: W1, W2) with a gated FFN (three matrices: W_gate, W_up, W_down). To keep total parameter count equal to a 4× expansion FFN, the hidden dim must be 2/3 × 4 × d = 8/3 × d. PaLM uses this ratio; LLaMA rounds to the nearest multiple of 256 for memory alignment.