§ 7 — Supervised Learning
Linear regression, logistic regression, SVMs, decision trees, random forests, gradient boosting, bias-variance decomposition, and cross-validation — the classical toolkit every ML engineer must know cold.
1. Overview
Supervised learning is the canonical ML paradigm: given a dataset of (x, y) pairs, find a function f that maps inputs to outputs with low generalisation error. The pipeline runs from raw data through feature engineering, model training, evaluation, and hyper-parameter tuning to deployment.
The model families covered here — linear, kernel, tree, ensemble — form the baseline any deep-learning practitioner must beat. Gradient boosting (XGBoost / LightGBM) still outperforms neural nets on structured/tabular data at most production companies. Understanding why — bias-variance tradeoffs, regularisation, ensemble mechanics — transfers directly to understanding transformers, LoRA, and dropout.
2. Linear Models
2.1 Linear Regression
Linear regression models the target as a weighted sum of features: ŷ = wTx + b. Minimising mean squared error (MSE) yields a closed-form solution via the normal equations:
w* = (X^T X)^{-1} X^T y # closed form — O(d^3) matrix inversion
# impractical for d > 10,000; use SGD insteadRegularisation penalises weight magnitude to prevent overfitting. Ridge adds an L2 penalty, which shrinks all weights; Lasso adds L1, which drives small weights exactly to zero (sparse solutions):
| Method | Penalty | Effect | Closed form? |
|---|---|---|---|
| OLS | none | MLE estimate | Yes |
| Ridge (L2) | λ‖w‖² | Shrink all uniformly; stable | Yes — (XᵀX+λI)⁻¹Xᵀy |
| Lasso (L1) | λ‖w‖₁ | Sparse weights; feature selection | No — coordinate descent |
| ElasticNet | αL1+(1-α)L2 | Sparse + stable; correlated features | No |
2.2 Logistic Regression
Logistic regression squashes the linear score through a sigmoid to produce a probability: P(y=1|x) = σ(wTx + b). Maximising the log-likelihood (MLE) is equivalent to minimising cross-entropy loss. The decision boundary wTx + b = 0 is always a hyperplane — logistic regression is a linear classifier.
∂L/∂w = (σ(wᵀx) - y) · x. The gradient has the same form as linear regression but with the sigmoid activation — the same pattern appears in transformer output layers.2.3 k-NN and Naive Bayes
k-Nearest Neighbors stores all training data and classifies by majority vote of the k closest examples. It is non-parametric with no training cost but O(N·d) inference cost — impractical beyond a few million points without approximate nearest-neighbour (ANN) structures like HNSW (which is also used in vector databases for RAG).
Naive Bayes applies Bayes rule assuming features are conditionally independent given the class: P(y|x) ∝ P(y) · ∏ P(xj|y). Despite its strong independence assumption it works well for text classification and is O(N·d) to train and O(d) to predict.
3. Support Vector Machines
SVMs find the hyperplane that maximises the geometric margin between classes. Only the training examples closest to the boundary — support vectors — determine the solution; all other points are irrelevant to the optimal hyperplane.
Kernel Trick
The dual SVM formulation depends only on inner products xiTxj. Replace them with a kernel function K(xi, xj) = φ(xi)Tφ(xj) without ever computing the (possibly infinite-dimensional) feature map φ(x). The RBF kernel is a universal approximator — it can represent any continuous target function given enough support vectors.
4. Decision Trees & Ensembles
4.1 Decision Trees
A decision tree recursively partitions the feature space by choosing the split that maximally reduces impurity. Two common impurity criteria:
| Criterion | Formula | Used by |
|---|---|---|
| Gini impurity | 1 - Σ pₖ² | sklearn default, CART |
| Information gain (entropy) | -Σ pₖ log pₖ | ID3, C4.5 |
| MSE (regression) | E[(y-ȳ)²] | All regression trees |
An unpruned tree of sufficient depth can memorise any training set (zero training error) but generalises poorly — it has near-zero bias but high variance. Pruning (post-pruning via cost-complexity, or pre-pruning via max-depth / min-samples-leaf) trades bias for variance.
4.2 Bagging → Random Forests
Bagging reduces variance by averaging predictions of many independently trained models on bootstrap samples. Random Forest adds random feature subsampling at each split (√p features for classification, p/3 for regression), which decorrelates the trees and further reduces variance without increasing bias.
4.3 Boosting — GBM / XGBoost / LightGBM
Boosting builds an additive model sequentially. At step m, a new weak learner h_m is fit to the negative gradient of the loss with respect to the current predictions — i.e., the residuals for MSE, but generalised to any differentiable loss. The ensemble is F_m(x) = F_(m-1)(x) + ν · h_m(x) where ν is the learning rate (shrinkage).
Core Mechanism — GBM Walkthrough
Background: we have 4 training examples with a numeric target and want to build a 3-tree GBM using MSE loss.
Plan:
- Initialise
F_0(x) = mean(y). - Compute pseudo-residuals
r_i = y_i - F_(m-1)(x_i). - Fit a depth-1 tree to the residuals.
- Update:
F_m = F_(m-1) + ν · h_m. - Repeat M times.
Walkthrough (initial state: y = [10, 20, 30, 40], ν = 0.5):
| Step | F(x) | Residuals | Tree pred | MSE |
|---|---|---|---|---|
| Init | [25, 25, 25, 25] | [-15, -5, 5, 15] | — | 125.0 |
| m=1 | [17.5, 22.5, 27.5, 32.5] | [-7.5, -2.5, 2.5, 7.5] | [-15, -5, 5, 15] | 31.25 |
| m=2 | [21.25, 23.75, 26.25, 28.75] | [-3.75, -1.25, 1.25, 3.75] | [-7.5, -2.5, 2.5, 7.5] | 7.81 |
Each iteration the MSE shrinks by factor ν² = 0.25 (for perfect depth-1 tree). XGBoost accelerates this by using the second-order Taylor expansion of the loss, giving a better leaf-value estimate than just fitting raw residuals.
5. Bias-Variance Decomposition
The expected test error of any estimator decomposes into three irreducible components. For squared loss:
E[(y - ŷ)²] = Bias²(ŷ) + Var(ŷ) + σ²_noise Bias² = how wrong the expected prediction is (systematic error) Var = how much predictions vary across datasets (estimation noise) σ² = irreducible label noise — lower bound of any model
Double Descent
The classical bias-variance U-curve predicted that test error would rise once model capacity exceeded the training-set size. In 2019, Belkin et al. showed that test error falls again past the interpolation threshold — a second descent. This "double descent" is now the canonical explanation for why massively over-parameterised neural nets generalise despite memorising training data.
Model Complexity vs Error (napkin table)
| Model | Bias | Variance | Regularisation lever |
|---|---|---|---|
| Depth-1 tree | High | Low | max_depth |
| Full-depth tree | Low | High | prune, min_leaf |
| Random Forest (100 trees) | Low | Low | n_estimators, max_features |
| GBM (low LR, many trees) | Low | Low–Med | LR, n_estimators, subsample |
| Linear reg (λ=0) | High (nonlinear data) | Low | Ridge λ, Lasso λ |
6. Cross-Validation & Regularisation
Cross-validation gives a nearly unbiased estimate of generalisation error by rotating through k non-overlapping validation folds. The gold standard is nested CV: an outer loop estimates generalisation; an inner loop tunes hyper-parameters. Collapsing the two loops causes over-optimistic performance estimates — a very common bug.
Data Leakage
Leakage means information about the test set contaminates the training pipeline. Common forms:
- Fitting a scaler or imputer on the full dataset before splitting — any statistic computed over all rows leaks test distribution into training.
- Future-leaking features in time-series (tomorrow's price in yesterday's row).
- Target encoding computed before fold split — each fold's validation target is baked into the encoding.
- Duplicated rows spanning train and test splits after deduplication was skipped.
7. Minimal Demos
Demo 1 — Decision Boundary: Logistic Regression vs Neural Net
Logistic regression cannot separate a spiral dataset (non-linear boundary) but a 2-layer neural net can. The same mechanism explains why linear attention variants underperform full softmax attention on tasks requiring non-linear mixing.
Demo 2 — Bias-Variance Decomposition via Polynomial Regression
We run B=100 bootstrap experiments for each polynomial degree, computing the empirical bias² and variance at each degree. The sweet spot (degrees 3–5) minimises total expected error on a sine-curve target.
8. Production / Source Pointers
| Concept | Library | Key file / function |
|---|---|---|
| Random Forest | scikit-learn | sklearn/ensemble/_forest.py · RandomForestClassifier |
| GBM (baseline) | scikit-learn | sklearn/ensemble/_gb.py · GradientBoostingClassifier._fit_stage |
| XGBoost tree growth | dmlc/xgboost | src/tree/updater_histmaker.cc · FindSplit |
| LightGBM leaf-wise | microsoft/LightGBM | src/treelearner/serial_tree_learner.cpp · FindBestSplit |
| Ridge regression | scikit-learn | sklearn/linear_model/_ridge.py · Ridge.fit |
| SVM (liblinear / libsvm) | scikit-learn | sklearn/svm/_base.py · SVC._fit; wraps libsvm dual solver |
9. References
Papers
- Breiman (2001) — Random Forests. Machine Learning 45(1): 5–32.
- Friedman (2001) — Greedy Function Approximation: A Gradient Boosting Machine. Annals of Statistics 29(5): 1189–1232.
- Chen & Guestrin (2016) — XGBoost: A Scalable Tree Boosting System. arXiv:1603.02754.
- Ke et al. (2017) — LightGBM: A Highly Efficient Gradient Boosting Decision Tree. NeurIPS 2017.
- Prokhorenkova et al. (2018) — CatBoost: unbiased boosting with categorical features. NeurIPS 2018.
- Belkin et al. (2019) — Reconciling modern machine learning practice and the bias-variance trade-off. arXiv:1812.11118.
- Cortes & Vapnik (1995) — Support-vector networks. Machine Learning 20(3): 273–297.
- Geman et al. (1992) — Neural Networks and the Bias/Variance Dilemma. Neural Computation 4(1).
Lectures
- Stanford CS229 — Andrew Ng, supervised learning notes (linear reg, logistic reg, SVM, trees). Available at cs229.stanford.edu.
- CMU 10-701 — Tom Mitchell / Barnabás Póczos, Machine Learning lectures (bias-variance, ensembles).
- MIT 6.036 — Introduction to Machine Learning (cross-validation, regularisation).
- Caltech "Learning From Data" — Yaser Abu-Mostafa, lecture 8 (bias-variance) and lecture 14 (SVM). Available on YouTube / edX.
- fast.ai Practical DL — Jeremy Howard, lessons on gradient boosting for tabular data.
Textbooks
- Hastie, Tibshirani & Friedman — The Elements of Statistical Learning (free PDF, Chapters 2, 7, 8, 10, 15).
- Bishop — Pattern Recognition and Machine Learning (Chapters 3–4, 6, 14).
- Murphy — Probabilistic Machine Learning: An Introduction (free draft, Chapters 16–18).
- Vapnik — The Nature of Statistical Learning Theory.
Code / Repos
scikit-learn/scikit-learn— canonical Python ML library.dmlc/xgboost— XGBoost source and paper.microsoft/LightGBM— LightGBM source.catboost/catboost— CatBoost source.
Blog posts
- Sebastian Raschka — Machine Learning FAQ (bias-variance, model comparison).
- distill.pub — How to Use t-SNE Effectively.
- Lilian Weng — An Overview of Gradient Descent Optimization Algorithms.
- scikit-learn user guide — Ensemble methods (bagging, random forests, gradient boosting).
10. Interview Prep
- Derive the closed-form solution for ridge regression and explain why it always has a unique solution even when XTX is singular.
Adding λI makes the matrix (XTX + λI) strictly positive definite for any λ>0, so it is always invertible. The solution
w* = (XᵀX + λI)⁻¹Xᵀyis the Tikhonov regularised least-squares estimate. - Explain the kernel trick. Why is RBF a "universal approximator"?
The SVM dual only needs inner products between training examples. Replacing xiTxj with K(xi,xj) implicitly maps to a feature space where linear separation is possible, without computing φ(x). RBF corresponds to an infinite-dimensional feature space (via Mercer's theorem) that can express any continuous function, hence universal.
- What is the difference between bagging and boosting at the bias-variance level?
Bagging reduces variance (independent trees, averaged) without changing bias — each tree is fully grown (low bias). Boosting primarily reduces bias (each step corrects residuals) but can also reduce variance via shrinkage; it is more prone to overfitting on noisy labels.
- When would you use Random Forest vs XGBoost on tabular data?
RF is easier to tune (few hyper-params, robust defaults), faster to train in parallel, and does not overfit badly with extra trees. XGBoost usually achieves lower test error with tuning (especially on competition datasets) but requires careful LR + regularisation tuning. For a quick baseline: RF. For max performance with budget to tune: XGBoost/LightGBM.
- Describe data leakage and give a concrete example.
Leakage: test-set information flows into the training pipeline. Example: standardising features (subtract mean, divide std) computed over the full dataset before train/test split — the test mean is baked into training, making validation metrics overly optimistic. Fix: put all preprocessing inside a Pipeline / fit on train only.
- What is double descent? How does it relate to transformer training?
Belkin et al. 2019: as model size increases past the interpolation threshold (zero training loss), test error rises then falls again. Transformers trained with billions of parameters are deep in the "second descent" regime — they memorise training data yet generalise because SGD noise + weight decay act as implicit regularisers. This is why scaling laws show monotone improvement with parameter count.
- Walk through gradient boosting for one step using MSE loss. What does XGBoost add?
Negative gradient of MSE = residuals r_i = y_i - F(x_i). Fit a tree to residuals, update F += ν·tree. XGBoost adds: (1) second-order Taylor expansion for better leaf-value estimate; (2) L1/L2 regularisation on leaf scores; (3) column subsampling; (4) GPU histogram-based split finding — making it orders of magnitude faster.
- You have an imbalanced binary classification problem (1% positives). Which metrics do you use and why?
Accuracy is misleading (99% accuracy by predicting all-negative). Prefer Precision-Recall AUC (focuses on positive class performance), F1 (harmonic mean of P and R), or Brier score (proper scoring rule for calibration). ROC-AUC is less sensitive to class imbalance than PR-AUC for rare positives. If cost of false negatives ≫ false positives, optimise recall at a fixed precision threshold.