§ 11 Convolutional Networks
Convolution math, receptive fields, architecture lineage from LeNet to ConvNeXt, residual connections, depthwise-separable ops, and transfer learning — the foundation that dominated computer vision for a decade and still shapes modern architectures.
1. Overview
A convolutional neural network (CNN) exploits three structural priors of image data: local connectivity (nearby pixels are correlated), weight sharing (the same edge detector is useful everywhere in the image), and translation equivariance (sliding the filter across the image detects the same feature wherever it appears). These priors reduce parameters by orders of magnitude compared to a fully-connected network while encoding strong inductive biases that generalize well from limited data.
A modern CNN backbone follows a stage-by-stage pipeline: a stem layer halves the spatial resolution immediately, then repeated blocks in each stage gradually increase channel depth while decreasing spatial size. A global average pool collapses the spatial dimensions before a linear classifier.
2. Convolution Math
For a 2-D feature map of shape [C_in, H_in, W_in], a kernel of shape [C_out, C_in, K, K], stride S, padding P, and dilation D, the output spatial size is:
H_out = floor( (H_in + 2·P − D·(K−1) − 1) / S + 1 ) W_out = floor( (W_in + 2·P − D·(K−1) − 1) / S + 1 ) Parameter count (no bias): C_out × C_in × K × K FLOPs (multiply-adds): 2 × C_out × C_in × K × K × H_out × W_out
Four Key Parameters
| Param | Default | Effect on output size | Common usage |
|---|---|---|---|
| K | 3 | Shrinks output by K−1 (each side) when P=0 | 3×3 dominates (2×2 conv = two 3×3 for same RF) |
| S | 1 | Divides spatial size by S (approximately) | S=2 to downsample (replace max-pool in modern nets) |
| P | 0 | P=(K−1)/2 gives "same" padding (output = input size) | Always use same-padding within a stage, valid elsewhere |
| D | 1 | Inserts D−1 zeros between kernel elements; expands RF without more params | Semantic segmentation (DeepLab), audio (WaveNet) |
3. Receptive Field
The receptive field (RF) of a neuron is the set of input pixels that can influence its value. For a stack of stride-1 convolutions with kernel size K:
RF after L layers = 1 + L · (K − 1) # stride 1 only Examples (K=3, S=1): 5 layers → RF = 11 10 layers → RF = 21 50 layers → RF = 101 ← VGG-19 sees only 101 of 224 pixels centrally
This linear growth is surprisingly slow. A stride-2 operation (pool or strided conv) doubles the RF of every subsequent layer. That is why modern architectures downsample aggressively in the stem, then use same-padding blocks: the stem stride earns a large RF cheaply.
O(√L) rather than O(L). Large theoretical RF does not guarantee the network actually uses distant context.4. Architecture Evolution
Each generation of ImageNet-era CNN architectures solved a concrete problem the previous generation left open. The lineage below is a straight line of ablations, not competing schools:
| Architecture | Year | Params | IN-1k top-1 | Key innovation |
|---|---|---|---|---|
| LeNet-5 | 1998 | 60K | 99% (MNIST) | Conv→Pool pattern; end-to-end learning |
| AlexNet | 2012 | 62M | 57.1% | ReLU, Dropout, GPU training, data augmentation |
| VGG-16 | 2014 | 138M | 71.5% | Deep stack of 3×3 convs replaces large kernels |
| Inception v1 | 2014 | 6.8M | 69.8% | Parallel 1×1/3×3/5×3 branches; bottleneck |
| ResNet-50 | 2015 | 25M | 76.0% | Skip connections enable 100+ layer training |
| DenseNet-121 | 2016 | 7.9M | 74.9% | Each layer receives all prior feature maps |
| MobileNetV2 | 2018 | 3.4M | 71.8% | Depthwise-sep + inverted residuals |
| EfficientNet-B0 | 2019 | 5.3M | 77.1% | NAS compound scaling of depth/width/resolution |
| ConvNeXt-T | 2022 | 28M | 82.1% | Transformer design choices (LN, GELU, large kernel) in pure ConvNet |
What ConvNeXt Borrowed from Transformers
Liu et al. 2022 (arXiv:2201.03545) performed a systematic ablation starting from ResNet-50 and applying Swin Transformer design choices one-by-one. The five biggest gains:
- Patchify stem (4×4 non-overlapping conv) instead of 7×7 stride-2 → +0.6%
- Depthwise 7×7 conv (larger kernel, fewer params) → +0.1%
- Inverted bottleneck (4× channel expansion in FFN-like MLP) → +0.1%
- Replace ReLU with GELU → +0.1%
- Replace BatchNorm with LayerNorm → +0.1%
5. ResNet — Core Mechanism
Background: Before ResNet, adding more layers to VGG-style nets did not improve accuracy — it made it worse, even on the training set. This was not overfitting; it was an optimisation failure. The identity mapping (copy input to output) is a trivially optimal solution for surplus layers, but gradient descent could not find it in plain networks because gradients vanished before reaching early layers.
Plan:
- Reformulate the learning goal: instead of learning the desired mapping H(x), let each block learn the residual F(x) = H(x) − x.
- Add an identity shortcut: output = F(x) + x.
- If the block is useless, set F(x) = 0 (much easier than learning the identity).
- The shortcut creates a direct gradient path: ∂L/∂x gets an additive identity term I, preventing vanishing regardless of depth.
Walkthrough: gradient flow in a 152-layer ResNet
Initial conditions: ResNet-152 with 50 residual blocks, each consisting of a 3-layer bottleneck (1×1→3×3→1×1). Assume a unit loss gradient at the output.
| Step | What happens | Plain net (no skip) | ResNet (with skip) |
|---|---|---|---|
| 1 | Backward through block 50 (last) | g ← g · ∂F/∂x (multiply by Jacobian) | g ← g · (∂F/∂x + I) (identity adds 1) |
| 2 | After 50 blocks | g shrinks exponentially if ‖∂F/∂x‖ < 1 | g stays near 1 via identity term — gradient highway |
| 3 | Practical effect | Layer 1 receives gradient ≈ 0; does not learn | All layers receive meaningful gradient; 152 layers converge |
Projection Shortcut (when dimensions change)
When a block changes spatial size (stride 2) or channel depth, the identity shortcut cannot be added directly. A 1×1 projection conv is used: x_proj = W_s · x (1×1 conv, stride 2 if needed). He et al. show that this "option B" projection shortcut adds only a marginal improvement over zero-padding; the key is the skip connection itself, not the projection.
6. Special Operations
1×1 Convolution
A 1×1 conv is a per-pixel linear combination of channels — it does not look at spatial neighbors. Uses: (1) channel projection to increase or reduce C before an expensive 3×3; (2) bottleneck in ResNet and Inception to cut FLOPs; (3) mixing information across channels without touching spatial structure. Equivalent to a fully-connected layer applied identically at every spatial position.
Pooling
Max pooling keeps the strongest activation in each K×K window — useful for detecting whether a feature exists. Average pooling keeps the mean — useful when feature density matters. Global average pooling (GAP) collapses the entire H×W spatial map to a single value per channel, eliminating spatial location entirely. GAP at the network tail replaced the large FC layers of AlexNet/VGG, cutting parameters from ~100M to ~4M while adding implicit regularisation.
Depthwise-Separable Convolution
Factorize a standard conv into two steps: a depthwise conv (one K×K filter per channel, no cross-channel mixing) followed by a pointwise 1×1 conv (mix channels, no spatial info). The parameter count drops from K²·C_in·C_out to K²·C_in + C_in·C_out. For K=3, C_in=C_out=256, the reduction is roughly 8×.
Group Convolution
Divide C_in channels into G groups; each group has its own filter bank. Each output channel sees only C_in/G input channels. Depthwise conv is the extreme case G = C_in. ResNeXt (arXiv:1611.05431) showed that increasing groups (with fixed total compute) improves accuracy — the "cardinality" dimension.
7. Minimal Demo — Convolution Playground
Change K, S, P, and D at the top, re-run, and observe: (1) how output spatial size changes, (2) what the Sobel-x edge detector returns on a gradient input, (3) how slowly receptive field grows with stacked stride-1 convs, (4) how many parameters depthwise-separable saves.
8. Transfer Learning
A CNN pre-trained on ImageNet-21k or ImageNet-1k has learned a rich hierarchy of features: edges → textures → parts → objects. These features transfer broadly because the low and mid-level filters match natural image statistics regardless of the downstream task.
Standard Recipe
- Load pretrained backbone — e.g.
timm.create_model('convnext_tiny', pretrained=True). - Replace the head — swap the 1000-class linear layer for a new one matching your class count.
- Freeze backbone, train head first (5–10 epochs at LR=1e-3) — prevents early gradient noise from destroying pretrained features.
- Unfreeze backbone, fine-tune at low LR (1e-5 to 1e-4, often with layer-wise LR decay) — early layers need smaller updates.
- Use label smoothing + mixup — calibrated fine-tuning outperforms naive cross-entropy.
9. Production / Source Pointers
| Concept | File / function |
|---|---|
| ResNet implementation | torchvision/models/resnet.py — BasicBlock, Bottleneck, ResNet._make_layer |
| Depthwise-sep conv | torchvision/models/mobilenetv2.py — InvertedResidual |
| EfficientNet compound scaling | torchvision/models/efficientnet.py — _efficientnet |
| ConvNeXt block | torchvision/models/convnext.py — CNBlock (uses LayerNorm + GELU) |
| timm (1000+ pretrained CNNs) | github.com/huggingface/pytorch-image-models — timm/models/ |
| F.conv2d signature | torch/nn/functional.py — conv2d(input, weight, bias, stride, padding, dilation, groups) |
| CUDA conv kernel dispatch | aten/src/ATen/native/Convolution.cpp — at::convolution_overrideable |
| cuDNN autotuner | torch/backends/cudnn — benchmark=True selects fastest algorithm per input shape |
10. References
Papers
- LeCun et al. 1998 — Gradient-Based Learning Applied to Document Recognition (LeNet-5)
- Krizhevsky, Sutskever, Hinton 2012 — ImageNet Classification with Deep CNNs (AlexNet; NeurIPS 2012)
- Simonyan & Zisserman 2014 — Very Deep Convolutional Networks for Large-Scale Image Recognition (VGG; arXiv:1409.1556)
- Szegedy et al. 2014 — Going Deeper with Convolutions (Inception; arXiv:1409.4842)
- He et al. 2015 — Deep Residual Learning for Image Recognition (ResNet; arXiv:1512.03385)
- Huang et al. 2016 — Densely Connected Convolutional Networks (DenseNet; arXiv:1608.06993)
- Sandler et al. 2018 — MobileNetV2: Inverted Residuals and Linear Bottlenecks (arXiv:1801.04381)
- Tan & Le 2019 — EfficientNet: Rethinking Model Scaling for CNNs (arXiv:1905.11946)
- Xie et al. 2017 — Aggregated Residual Transformations for Deep Neural Networks (ResNeXt; arXiv:1611.05431)
- Liu et al. 2022 — A ConvNet for the 2020s (ConvNeXt; arXiv:2201.03545)
- Luo et al. 2016 — Understanding the Effective Receptive Field in Deep CNNs (arXiv:1701.04128)
- Howard et al. 2017 — MobileNets: Efficient CNNs for Mobile Vision Applications (arXiv:1704.04861)
Lectures
- Stanford CS231n — Convolutional Neural Networks for Visual Recognition (Karpathy, Johnson, Fei-Fei Li)
- MIT 6.S191 — Lecture 3: Convolutional Neural Networks (Alexander Amini)
- NYU DS-GA 1008 — Yann LeCun: Convolutional Networks, Architectures, and Applications
- fast.ai Practical Deep Learning — Lesson 1–5: ResNets, transfer learning, data augmentation
- Stanford CS230 — Deep Learning (Andrew Ng): CNNs module
Textbooks
- Goodfellow, Bengio, Courville — Deep Learning (MIT Press, free at deeplearningbook.org) — Chapter 9
- Zhang et al. — Dive Into Deep Learning (d2l.ai, free) — Chapters 6–8
- Prince — Understanding Deep Learning (free at udlbook.github.io) — Chapter 10
Code / Repos
pytorch/vision— torchvision model zoo (ResNet, EfficientNet, ConvNeXt)huggingface/pytorch-image-models(timm) — 700+ pretrained CNN/ViT modelsfacebookresearch/ConvNeXt— original ConvNeXt implementation
Blog Posts
- distill.pub — Feature Visualization (Olah et al. 2017): what CNN neurons actually detect
- distill.pub — Building Blocks of Interpretability (Olah et al. 2018)
- Karpathy blog — CS231n Convolutional Neural Networks for Visual Recognition (notes, freely accessible)
- Lilian Weng — Object Detection for Dummies parts 1–4: CNN backbones in detection context
11. Interview Prep
Q1. Given an input of size 224×224, a kernel of 3×3, stride 2, and no padding, what is the output spatial size?
H_out = floor((224 + 0 − 1·(3−1) − 1) / 2 + 1) = floor(221/2 + 1) = floor(111.5) = 111. Output: 111×111.
Q2. Why did plain VGG-style nets saturate at ~19 layers while ResNet trained 152+ layers successfully?
The identity skip connection in ResNet provides a direct gradient highway: ∂L/∂x receives an additive identity term (dF/dx + I), preventing gradient vanishing regardless of depth. In plain nets, the gradient must pass through every nonlinear transformation; if any Jacobian has eigenvalues < 1 consistently, the gradient shrinks exponentially.
Q3. What is the parameter count of a 3×3 depthwise-separable conv replacing a standard conv from C_in=256 to C_out=256?
Standard: 3×3×256×256 = 589,824. Depthwise: 3×3×256 = 2,304. Pointwise: 256×256 = 65,536. Total sep: 67,840 — about 8.7× fewer.
Q4. A 7-layer stack of 3×3 stride-1 convs: what is the theoretical receptive field? What is the practical effective RF?
Theoretical RF = 1 + 7×(3−1) = 15×15. Effective RF (ERF) is smaller — roughly O(√L) due to the multiplicative gradient weighting; central pixels dominate. ERF ≈ 7–9 pixels for L=7.
Q5. Explain why global average pooling replaced the large FC layers of AlexNet/VGG.
GAP collapses [B, C, H, W] → [B, C] with zero parameters (vs. FC which needs C·H·W×num_classes). It also acts as strong regularization (implicit averaging over spatial positions), improves spatial robustness, and allows the network to accept variable input sizes.
Q6. What did EfficientNet's compound scaling discover?
Scaling depth, width, and resolution simultaneously (compound coefficient φ with fixed ratios α, β, γ) outperforms scaling any single dimension alone. The optimal ratios were found via NAS on a small baseline (EfficientNet-B0). EfficientNet-B7 achieved SOTA with 8.4× fewer parameters than GPipe.
Q7. How does a dilated (atrous) conv grow the receptive field differently from stacked standard convs?
A standard 3×3 conv adds K−1=2 to the RF per layer (linear). A dilated conv with dilation D inserts D−1 zeros between kernel elements, effectively making the kernel size K_eff = K + (K−1)(D−1) = 2D+1. Using dilations 1,2,4,8 in successive layers grows RF as 3,7,15,31 — exponential growth with the same parameter count.
Q8. When fine-tuning a pretrained ResNet on a small custom dataset, why do you freeze early layers and use lower LR for the backbone?
Early layers encode low-level features (edges, colors) shared across all visual tasks — randomizing them wastes pretraining. The new head has random weights with large gradient norms; those gradients would overwrite learned features if the backbone LR is high. Layer-wise LR decay (e.g. 0.9× per layer from output to input) is the production recipe.