Part IV — Deep Learning

§ 11 Convolutional Networks

Convolution math, receptive fields, architecture lineage from LeNet to ConvNeXt, residual connections, depthwise-separable ops, and transfer learning — the foundation that dominated computer vision for a decade and still shapes modern architectures.

1. Overview

A convolutional neural network (CNN) exploits three structural priors of image data: local connectivity (nearby pixels are correlated), weight sharing (the same edge detector is useful everywhere in the image), and translation equivariance (sliding the filter across the image detects the same feature wherever it appears). These priors reduce parameters by orders of magnitude compared to a fully-connected network while encoding strong inductive biases that generalize well from limited data.

A modern CNN backbone follows a stage-by-stage pipeline: a stem layer halves the spatial resolution immediately, then repeated blocks in each stage gradually increase channel depth while decreasing spatial size. A global average pool collapses the spatial dimensions before a linear classifier.

2. Convolution Math

For a 2-D feature map of shape [C_in, H_in, W_in], a kernel of shape [C_out, C_in, K, K], stride S, padding P, and dilation D, the output spatial size is:

H_out = floor( (H_in + 2·P − D·(K−1) − 1) / S  +  1 )
W_out = floor( (W_in + 2·P − D·(K−1) − 1) / S  +  1 )

Parameter count (no bias):  C_out × C_in × K × K
FLOPs (multiply-adds):      2 × C_out × C_in × K × K × H_out × W_out

Four Key Parameters

Param	Default	Effect on output size	Common usage
K	3	Shrinks output by K−1 (each side) when P=0	3×3 dominates (2×2 conv = two 3×3 for same RF)
S	1	Divides spatial size by S (approximately)	S=2 to downsample (replace max-pool in modern nets)
P	0	P=(K−1)/2 gives "same" padding (output = input size)	Always use same-padding within a stage, valid elsewhere
D	1	Inserts D−1 zeros between kernel elements; expands RF without more params	Semantic segmentation (DeepLab), audio (WaveNet)

3. Receptive Field

The receptive field (RF) of a neuron is the set of input pixels that can influence its value. For a stack of stride-1 convolutions with kernel size K:

RF after L layers  =  1 + L · (K − 1)     # stride 1 only

Examples (K=3, S=1):
  5  layers → RF = 11
  10 layers → RF = 21
  50 layers → RF = 101   ← VGG-19 sees only 101 of 224 pixels centrally

This linear growth is surprisingly slow. A stride-2 operation (pool or strided conv) doubles the RF of every subsequent layer. That is why modern architectures downsample aggressively in the stem, then use same-padding blocks: the stem stride earns a large RF cheaply.

Effective receptive field (ERF) is smaller than the theoretical RF. Central pixels contribute much more strongly than peripheral ones due to multiplicative gradient paths — the ERF follows a roughly Gaussian distribution and is proportional to O(√L) rather than O(L). Large theoretical RF does not guarantee the network actually uses distant context.

4. Architecture Evolution

Each generation of ImageNet-era CNN architectures solved a concrete problem the previous generation left open. The lineage below is a straight line of ablations, not competing schools:

Architecture	Year	Params	IN-1k top-1	Key innovation
LeNet-5	1998	60K	99% (MNIST)	Conv→Pool pattern; end-to-end learning
AlexNet	2012	62M	57.1%	ReLU, Dropout, GPU training, data augmentation
VGG-16	2014	138M	71.5%	Deep stack of 3×3 convs replaces large kernels
Inception v1	2014	6.8M	69.8%	Parallel 1×1/3×3/5×3 branches; bottleneck
ResNet-50	2015	25M	76.0%	Skip connections enable 100+ layer training
DenseNet-121	2016	7.9M	74.9%	Each layer receives all prior feature maps
MobileNetV2	2018	3.4M	71.8%	Depthwise-sep + inverted residuals
EfficientNet-B0	2019	5.3M	77.1%	NAS compound scaling of depth/width/resolution
ConvNeXt-T	2022	28M	82.1%	Transformer design choices (LN, GELU, large kernel) in pure ConvNet

What ConvNeXt Borrowed from Transformers

Liu et al. 2022 (arXiv:2201.03545) performed a systematic ablation starting from ResNet-50 and applying Swin Transformer design choices one-by-one. The five biggest gains:

Patchify stem (4×4 non-overlapping conv) instead of 7×7 stride-2 → +0.6%
Depthwise 7×7 conv (larger kernel, fewer params) → +0.1%
Inverted bottleneck (4× channel expansion in FFN-like MLP) → +0.1%
Replace ReLU with GELU → +0.1%
Replace BatchNorm with LayerNorm → +0.1%

5. ResNet — Core Mechanism

Background: Before ResNet, adding more layers to VGG-style nets did not improve accuracy — it made it worse, even on the training set. This was not overfitting; it was an optimisation failure. The identity mapping (copy input to output) is a trivially optimal solution for surplus layers, but gradient descent could not find it in plain networks because gradients vanished before reaching early layers.

Plan:

Reformulate the learning goal: instead of learning the desired mapping H(x), let each block learn the residual F(x) = H(x) − x.
Add an identity shortcut: output = F(x) + x.
If the block is useless, set F(x) = 0 (much easier than learning the identity).
The shortcut creates a direct gradient path: ∂L/∂x gets an additive identity term I, preventing vanishing regardless of depth.

Walkthrough: gradient flow in a 152-layer ResNet

Initial conditions: ResNet-152 with 50 residual blocks, each consisting of a 3-layer bottleneck (1×1→3×3→1×1). Assume a unit loss gradient at the output.

Step	What happens	Plain net (no skip)	ResNet (with skip)
1	Backward through block 50 (last)	g ← g · ∂F/∂x (multiply by Jacobian)	g ← g · (∂F/∂x + I) (identity adds 1)
2	After 50 blocks	g shrinks exponentially if ‖∂F/∂x‖ < 1	g stays near 1 via identity term — gradient highway
3	Practical effect	Layer 1 receives gradient ≈ 0; does not learn	All layers receive meaningful gradient; 152 layers converge

Projection Shortcut (when dimensions change)

When a block changes spatial size (stride 2) or channel depth, the identity shortcut cannot be added directly. A 1×1 projection conv is used: x_proj = W_s · x (1×1 conv, stride 2 if needed). He et al. show that this "option B" projection shortcut adds only a marginal improvement over zero-padding; the key is the skip connection itself, not the projection.

6. Special Operations

1×1 Convolution

A 1×1 conv is a per-pixel linear combination of channels — it does not look at spatial neighbors. Uses: (1) channel projection to increase or reduce C before an expensive 3×3; (2) bottleneck in ResNet and Inception to cut FLOPs; (3) mixing information across channels without touching spatial structure. Equivalent to a fully-connected layer applied identically at every spatial position.

Pooling

Max pooling keeps the strongest activation in each K×K window — useful for detecting whether a feature exists. Average pooling keeps the mean — useful when feature density matters. Global average pooling (GAP) collapses the entire H×W spatial map to a single value per channel, eliminating spatial location entirely. GAP at the network tail replaced the large FC layers of AlexNet/VGG, cutting parameters from ~100M to ~4M while adding implicit regularisation.

Depthwise-Separable Convolution

Factorize a standard conv into two steps: a depthwise conv (one K×K filter per channel, no cross-channel mixing) followed by a pointwise 1×1 conv (mix channels, no spatial info). The parameter count drops from K²·C_in·C_out to K²·C_in + C_in·C_out. For K=3, C_in=C_out=256, the reduction is roughly 8×.

Group Convolution

Divide C_in channels into G groups; each group has its own filter bank. Each output channel sees only C_in/G input channels. Depthwise conv is the extreme case G = C_in. ResNeXt (arXiv:1611.05431) showed that increasing groups (with fixed total compute) improves accuracy — the "cardinality" dimension.

7. Minimal Demo — Convolution Playground

Change K, S, P, and D at the top, re-run, and observe: (1) how output spatial size changes, (2) what the Sobel-x edge detector returns on a gradient input, (3) how slowly receptive field grows with stacked stride-1 convs, (4) how many parameters depthwise-separable saves.

Convolution Playground — C Demo

stdin (optional)

8. Transfer Learning

A CNN pre-trained on ImageNet-21k or ImageNet-1k has learned a rich hierarchy of features: edges → textures → parts → objects. These features transfer broadly because the low and mid-level filters match natural image statistics regardless of the downstream task.

Standard Recipe

Load pretrained backbone — e.g. timm.create_model('convnext_tiny', pretrained=True).
Replace the head — swap the 1000-class linear layer for a new one matching your class count.
Freeze backbone, train head first (5–10 epochs at LR=1e-3) — prevents early gradient noise from destroying pretrained features.
Unfreeze backbone, fine-tune at low LR (1e-5 to 1e-4, often with layer-wise LR decay) — early layers need smaller updates.
Use label smoothing + mixup — calibrated fine-tuning outperforms naive cross-entropy.

When to fine-tune vs train from scratch: if your dataset is <100K samples, always start from pretrained. If your domain is radically different from ImageNet (e.g. satellite imagery, microscopy), pretrained features still provide better init than random; the benefit is more pronounced for small datasets.

9. Production / Source Pointers

Concept	File / function
ResNet implementation	torchvision/models/resnet.py — BasicBlock, Bottleneck, ResNet._make_layer
Depthwise-sep conv	torchvision/models/mobilenetv2.py — InvertedResidual
EfficientNet compound scaling	torchvision/models/efficientnet.py — _efficientnet
ConvNeXt block	torchvision/models/convnext.py — CNBlock (uses LayerNorm + GELU)
timm (1000+ pretrained CNNs)	github.com/huggingface/pytorch-image-models — timm/models/
F.conv2d signature	torch/nn/functional.py — conv2d(input, weight, bias, stride, padding, dilation, groups)
CUDA conv kernel dispatch	aten/src/ATen/native/Convolution.cpp — at::convolution_overrideable
cuDNN autotuner	torch/backends/cudnn — benchmark=True selects fastest algorithm per input shape

10. References

Papers

LeCun et al. 1998 — Gradient-Based Learning Applied to Document Recognition (LeNet-5)
Krizhevsky, Sutskever, Hinton 2012 — ImageNet Classification with Deep CNNs (AlexNet; NeurIPS 2012)
Simonyan & Zisserman 2014 — Very Deep Convolutional Networks for Large-Scale Image Recognition (VGG; arXiv:1409.1556)
Szegedy et al. 2014 — Going Deeper with Convolutions (Inception; arXiv:1409.4842)
He et al. 2015 — Deep Residual Learning for Image Recognition (ResNet; arXiv:1512.03385)
Huang et al. 2016 — Densely Connected Convolutional Networks (DenseNet; arXiv:1608.06993)
Sandler et al. 2018 — MobileNetV2: Inverted Residuals and Linear Bottlenecks (arXiv:1801.04381)
Tan & Le 2019 — EfficientNet: Rethinking Model Scaling for CNNs (arXiv:1905.11946)
Xie et al. 2017 — Aggregated Residual Transformations for Deep Neural Networks (ResNeXt; arXiv:1611.05431)
Liu et al. 2022 — A ConvNet for the 2020s (ConvNeXt; arXiv:2201.03545)
Luo et al. 2016 — Understanding the Effective Receptive Field in Deep CNNs (arXiv:1701.04128)
Howard et al. 2017 — MobileNets: Efficient CNNs for Mobile Vision Applications (arXiv:1704.04861)

Lectures

Stanford CS231n — Convolutional Neural Networks for Visual Recognition (Karpathy, Johnson, Fei-Fei Li)
MIT 6.S191 — Lecture 3: Convolutional Neural Networks (Alexander Amini)
NYU DS-GA 1008 — Yann LeCun: Convolutional Networks, Architectures, and Applications
fast.ai Practical Deep Learning — Lesson 1–5: ResNets, transfer learning, data augmentation
Stanford CS230 — Deep Learning (Andrew Ng): CNNs module

Textbooks

Goodfellow, Bengio, Courville — Deep Learning (MIT Press, free at deeplearningbook.org) — Chapter 9
Zhang et al. — Dive Into Deep Learning (d2l.ai, free) — Chapters 6–8
Prince — Understanding Deep Learning (free at udlbook.github.io) — Chapter 10

Code / Repos

pytorch/vision — torchvision model zoo (ResNet, EfficientNet, ConvNeXt)
huggingface/pytorch-image-models (timm) — 700+ pretrained CNN/ViT models
facebookresearch/ConvNeXt — original ConvNeXt implementation

Blog Posts

distill.pub — Feature Visualization (Olah et al. 2017): what CNN neurons actually detect
distill.pub — Building Blocks of Interpretability (Olah et al. 2018)
Karpathy blog — CS231n Convolutional Neural Networks for Visual Recognition (notes, freely accessible)
Lilian Weng — Object Detection for Dummies parts 1–4: CNN backbones in detection context

11. Interview Prep

Q1. Given an input of size 224×224, a kernel of 3×3, stride 2, and no padding, what is the output spatial size?

H_out = floor((224 + 0 − 1·(3−1) − 1) / 2 + 1) = floor(221/2 + 1) = floor(111.5) = 111. Output: 111×111.

Q2. Why did plain VGG-style nets saturate at ~19 layers while ResNet trained 152+ layers successfully?

The identity skip connection in ResNet provides a direct gradient highway: ∂L/∂x receives an additive identity term (dF/dx + I), preventing gradient vanishing regardless of depth. In plain nets, the gradient must pass through every nonlinear transformation; if any Jacobian has eigenvalues < 1 consistently, the gradient shrinks exponentially.

Q3. What is the parameter count of a 3×3 depthwise-separable conv replacing a standard conv from C_in=256 to C_out=256?

Standard: 3×3×256×256 = 589,824. Depthwise: 3×3×256 = 2,304. Pointwise: 256×256 = 65,536. Total sep: 67,840 — about 8.7× fewer.

Q4. A 7-layer stack of 3×3 stride-1 convs: what is the theoretical receptive field? What is the practical effective RF?

Theoretical RF = 1 + 7×(3−1) = 15×15. Effective RF (ERF) is smaller — roughly O(√L) due to the multiplicative gradient weighting; central pixels dominate. ERF ≈ 7–9 pixels for L=7.

Q5. Explain why global average pooling replaced the large FC layers of AlexNet/VGG.

GAP collapses [B, C, H, W] → [B, C] with zero parameters (vs. FC which needs C·H·W×num_classes). It also acts as strong regularization (implicit averaging over spatial positions), improves spatial robustness, and allows the network to accept variable input sizes.

Q6. What did EfficientNet's compound scaling discover?

Scaling depth, width, and resolution simultaneously (compound coefficient φ with fixed ratios α, β, γ) outperforms scaling any single dimension alone. The optimal ratios were found via NAS on a small baseline (EfficientNet-B0). EfficientNet-B7 achieved SOTA with 8.4× fewer parameters than GPipe.

Q7. How does a dilated (atrous) conv grow the receptive field differently from stacked standard convs?

A standard 3×3 conv adds K−1=2 to the RF per layer (linear). A dilated conv with dilation D inserts D−1 zeros between kernel elements, effectively making the kernel size K_eff = K + (K−1)(D−1) = 2D+1. Using dilations 1,2,4,8 in successive layers grows RF as 3,7,15,31 — exponential growth with the same parameter count.

Q8. When fine-tuning a pretrained ResNet on a small custom dataset, why do you freeze early layers and use lower LR for the backbone?

Early layers encode low-level features (edges, colors) shared across all visual tasks — randomizing them wastes pretraining. The new head has random weights with large gradient norms; those gradients would overwrite learned features if the backbone LR is high. Layer-wise LR decay (e.g. 0.9× per layer from output to input) is the production recipe.