Part III — Classical ML

§ 7 — Supervised Learning

Linear regression, logistic regression, SVMs, decision trees, random forests, gradient boosting, bias-variance decomposition, and cross-validation — the classical toolkit every ML engineer must know cold.

1. Overview

Supervised learning is the canonical ML paradigm: given a dataset of (x, y) pairs, find a function f that maps inputs to outputs with low generalisation error. The pipeline runs from raw data through feature engineering, model training, evaluation, and hyper-parameter tuning to deployment.

The model families covered here — linear, kernel, tree, ensemble — form the baseline any deep-learning practitioner must beat. Gradient boosting (XGBoost / LightGBM) still outperforms neural nets on structured/tabular data at most production companies. Understanding why — bias-variance tradeoffs, regularisation, ensemble mechanics — transfers directly to understanding transformers, LoRA, and dropout.

2. Linear Models

2.1 Linear Regression

Linear regression models the target as a weighted sum of features: ŷ = w^Tx + b. Minimising mean squared error (MSE) yields a closed-form solution via the normal equations:

w* = (X^T X)^{-1} X^T y       # closed form — O(d^3) matrix inversion
                                # impractical for d > 10,000; use SGD instead

Regularisation penalises weight magnitude to prevent overfitting. Ridge adds an L2 penalty, which shrinks all weights; Lasso adds L1, which drives small weights exactly to zero (sparse solutions):

Method	Penalty	Effect	Closed form?
OLS	none	MLE estimate	Yes
Ridge (L2)	`λ‖w‖²`	Shrink all uniformly; stable	Yes — `(XᵀX+λI)⁻¹Xᵀy`
Lasso (L1)	`λ‖w‖₁`	Sparse weights; feature selection	No — coordinate descent
ElasticNet	`αL1+(1-α)L2`	Sparse + stable; correlated features	No

2.2 Logistic Regression

Logistic regression squashes the linear score through a sigmoid to produce a probability: P(y=1|x) = σ(w^Tx + b). Maximising the log-likelihood (MLE) is equivalent to minimising cross-entropy loss. The decision boundary w^Tx + b = 0 is always a hyperplane — logistic regression is a linear classifier.

Gradient derivation (useful interview question): for a single example, ∂L/∂w = (σ(wᵀx) - y) · x. The gradient has the same form as linear regression but with the sigmoid activation — the same pattern appears in transformer output layers.

2.3 k-NN and Naive Bayes

k-Nearest Neighbors stores all training data and classifies by majority vote of the k closest examples. It is non-parametric with no training cost but O(N·d) inference cost — impractical beyond a few million points without approximate nearest-neighbour (ANN) structures like HNSW (which is also used in vector databases for RAG).

Naive Bayes applies Bayes rule assuming features are conditionally independent given the class: P(y|x) ∝ P(y) · ∏ P(x_j|y). Despite its strong independence assumption it works well for text classification and is O(N·d) to train and O(d) to predict.

3. Support Vector Machines

SVMs find the hyperplane that maximises the geometric margin between classes. Only the training examples closest to the boundary — support vectors — determine the solution; all other points are irrelevant to the optimal hyperplane.

Kernel Trick

The dual SVM formulation depends only on inner products x_i^Tx_j. Replace them with a kernel function K(x_i, x_j) = φ(x_i)^Tφ(x_j) without ever computing the (possibly infinite-dimensional) feature map φ(x). The RBF kernel is a universal approximator — it can represent any continuous target function given enough support vectors.

Scaling caveat: the SVM dual has O(N²) to O(N³) training complexity. SVMs are rarely used beyond ~10 M examples today; gradient-boosted trees and neural nets scale better. Kernels remain important conceptually — the NTK (Neural Tangent Kernel) analysis of infinite-width neural nets builds directly on kernel theory.

4. Decision Trees & Ensembles

4.1 Decision Trees

A decision tree recursively partitions the feature space by choosing the split that maximally reduces impurity. Two common impurity criteria:

Criterion	Formula	Used by
Gini impurity	`1 - Σ pₖ²`	sklearn default, CART
Information gain (entropy)	`-Σ pₖ log pₖ`	ID3, C4.5
MSE (regression)	`E[(y-ȳ)²]`	All regression trees

An unpruned tree of sufficient depth can memorise any training set (zero training error) but generalises poorly — it has near-zero bias but high variance. Pruning (post-pruning via cost-complexity, or pre-pruning via max-depth / min-samples-leaf) trades bias for variance.

4.2 Bagging → Random Forests

Bagging reduces variance by averaging predictions of many independently trained models on bootstrap samples. Random Forest adds random feature subsampling at each split (√p features for classification, p/3 for regression), which decorrelates the trees and further reduces variance without increasing bias.

4.3 Boosting — GBM / XGBoost / LightGBM

Boosting builds an additive model sequentially. At step m, a new weak learner h_m is fit to the negative gradient of the loss with respect to the current predictions — i.e., the residuals for MSE, but generalised to any differentiable loss. The ensemble is F_m(x) = F_(m-1)(x) + ν · h_m(x) where ν is the learning rate (shrinkage).

Core Mechanism — GBM Walkthrough

Background: we have 4 training examples with a numeric target and want to build a 3-tree GBM using MSE loss.

Plan:

Initialise F_0(x) = mean(y).
Compute pseudo-residuals r_i = y_i - F_(m-1)(x_i).
Fit a depth-1 tree to the residuals.
Update: F_m = F_(m-1) + ν · h_m.
Repeat M times.

Walkthrough (initial state: y = [10, 20, 30, 40], ν = 0.5):

Step	F(x)	Residuals	Tree pred	MSE
Init	[25, 25, 25, 25]	[-15, -5, 5, 15]	—	125.0
m=1	[17.5, 22.5, 27.5, 32.5]	[-7.5, -2.5, 2.5, 7.5]	[-15, -5, 5, 15]	31.25
m=2	[21.25, 23.75, 26.25, 28.75]	[-3.75, -1.25, 1.25, 3.75]	[-7.5, -2.5, 2.5, 7.5]	7.81

Each iteration the MSE shrinks by factor ν² = 0.25 (for perfect depth-1 tree). XGBoost accelerates this by using the second-order Taylor expansion of the loss, giving a better leaf-value estimate than just fitting raw residuals.

5. Bias-Variance Decomposition

The expected test error of any estimator decomposes into three irreducible components. For squared loss:

E[(y - ŷ)²] = Bias²(ŷ) + Var(ŷ) + σ²_noise

Bias²  = how wrong the expected prediction is  (systematic error)
Var    = how much predictions vary across datasets  (estimation noise)
σ²     = irreducible label noise — lower bound of any model

Double Descent

The classical bias-variance U-curve predicted that test error would rise once model capacity exceeded the training-set size. In 2019, Belkin et al. showed that test error falls again past the interpolation threshold — a second descent. This "double descent" is now the canonical explanation for why massively over-parameterised neural nets generalise despite memorising training data.

Practical rule: for classical models (trees, linear), use the U-curve — tune complexity with cross-validation. For neural networks, massive over-parameterisation often outperforms the classical optimum, especially with implicit regularisation from SGD noise, dropout, and weight decay.

Model Complexity vs Error (napkin table)

Model	Bias	Variance	Regularisation lever
Depth-1 tree	High	Low	max_depth
Full-depth tree	Low	High	prune, min_leaf
Random Forest (100 trees)	Low	Low	n_estimators, max_features
GBM (low LR, many trees)	Low	Low–Med	LR, n_estimators, subsample
Linear reg (λ=0)	High (nonlinear data)	Low	Ridge λ, Lasso λ

6. Cross-Validation & Regularisation

Cross-validation gives a nearly unbiased estimate of generalisation error by rotating through k non-overlapping validation folds. The gold standard is nested CV: an outer loop estimates generalisation; an inner loop tunes hyper-parameters. Collapsing the two loops causes over-optimistic performance estimates — a very common bug.

Data Leakage

Leakage means information about the test set contaminates the training pipeline. Common forms:

Fitting a scaler or imputer on the full dataset before splitting — any statistic computed over all rows leaks test distribution into training.
Future-leaking features in time-series (tomorrow's price in yesterday's row).
Target encoding computed before fold split — each fold's validation target is baked into the encoding.
Duplicated rows spanning train and test splits after deduplication was skipped.

Detection: if validation accuracy is suspiciously higher than expected from domain knowledge, suspect leakage. Shuffle the labels of the training set — a leaked model still performs well; a legitimate model degrades to chance.

7. Minimal Demos

Demo 1 — Decision Boundary: Logistic Regression vs Neural Net

Logistic regression cannot separate a spiral dataset (non-linear boundary) but a 2-layer neural net can. The same mechanism explains why linear attention variants underperform full softmax attention on tasks requiring non-linear mixing.

Decision Boundary: LR vs Neural Net (spiral dataset) — C Demo

stdin (optional)

Demo 2 — Bias-Variance Decomposition via Polynomial Regression

We run B=100 bootstrap experiments for each polynomial degree, computing the empirical bias² and variance at each degree. The sweet spot (degrees 3–5) minimises total expected error on a sine-curve target.

Bias-Variance Tradeoff via Polynomial Regression — C Demo

stdin (optional)

8. Production / Source Pointers

Concept	Library	Key file / function
Random Forest	scikit-learn	`sklearn/ensemble/_forest.py · RandomForestClassifier`
GBM (baseline)	scikit-learn	`sklearn/ensemble/_gb.py · GradientBoostingClassifier._fit_stage`
XGBoost tree growth	dmlc/xgboost	`src/tree/updater_histmaker.cc · FindSplit`
LightGBM leaf-wise	microsoft/LightGBM	`src/treelearner/serial_tree_learner.cpp · FindBestSplit`
Ridge regression	scikit-learn	`sklearn/linear_model/_ridge.py · Ridge.fit`
SVM (liblinear / libsvm)	scikit-learn	`sklearn/svm/_base.py · SVC._fit; wraps libsvm dual solver`

9. References

Papers

Breiman (2001) — Random Forests. Machine Learning 45(1): 5–32.
Friedman (2001) — Greedy Function Approximation: A Gradient Boosting Machine. Annals of Statistics 29(5): 1189–1232.
Chen & Guestrin (2016) — XGBoost: A Scalable Tree Boosting System. arXiv:1603.02754.
Ke et al. (2017) — LightGBM: A Highly Efficient Gradient Boosting Decision Tree. NeurIPS 2017.
Prokhorenkova et al. (2018) — CatBoost: unbiased boosting with categorical features. NeurIPS 2018.
Belkin et al. (2019) — Reconciling modern machine learning practice and the bias-variance trade-off. arXiv:1812.11118.
Cortes & Vapnik (1995) — Support-vector networks. Machine Learning 20(3): 273–297.
Geman et al. (1992) — Neural Networks and the Bias/Variance Dilemma. Neural Computation 4(1).

Lectures

Stanford CS229 — Andrew Ng, supervised learning notes (linear reg, logistic reg, SVM, trees). Available at cs229.stanford.edu.
CMU 10-701 — Tom Mitchell / Barnabás Póczos, Machine Learning lectures (bias-variance, ensembles).
MIT 6.036 — Introduction to Machine Learning (cross-validation, regularisation).
Caltech "Learning From Data" — Yaser Abu-Mostafa, lecture 8 (bias-variance) and lecture 14 (SVM). Available on YouTube / edX.
fast.ai Practical DL — Jeremy Howard, lessons on gradient boosting for tabular data.

Textbooks

Hastie, Tibshirani & Friedman — The Elements of Statistical Learning (free PDF, Chapters 2, 7, 8, 10, 15).
Bishop — Pattern Recognition and Machine Learning (Chapters 3–4, 6, 14).
Murphy — Probabilistic Machine Learning: An Introduction (free draft, Chapters 16–18).
Vapnik — The Nature of Statistical Learning Theory.

Code / Repos

scikit-learn/scikit-learn — canonical Python ML library.
dmlc/xgboost — XGBoost source and paper.
microsoft/LightGBM — LightGBM source.
catboost/catboost — CatBoost source.

Blog posts

Sebastian Raschka — Machine Learning FAQ (bias-variance, model comparison).
distill.pub — How to Use t-SNE Effectively.
Lilian Weng — An Overview of Gradient Descent Optimization Algorithms.
scikit-learn user guide — Ensemble methods (bagging, random forests, gradient boosting).

10. Interview Prep

Derive the closed-form solution for ridge regression and explain why it always has a unique solution even when X^TX is singular.
Adding λI makes the matrix (X^TX + λI) strictly positive definite for any λ>0, so it is always invertible. The solution w* = (XᵀX + λI)⁻¹Xᵀy is the Tikhonov regularised least-squares estimate.
Explain the kernel trick. Why is RBF a "universal approximator"?
The SVM dual only needs inner products between training examples. Replacing x_i^Tx_j with K(x_i,x_j) implicitly maps to a feature space where linear separation is possible, without computing φ(x). RBF corresponds to an infinite-dimensional feature space (via Mercer's theorem) that can express any continuous function, hence universal.
What is the difference between bagging and boosting at the bias-variance level?
Bagging reduces variance (independent trees, averaged) without changing bias — each tree is fully grown (low bias). Boosting primarily reduces bias (each step corrects residuals) but can also reduce variance via shrinkage; it is more prone to overfitting on noisy labels.
When would you use Random Forest vs XGBoost on tabular data?
RF is easier to tune (few hyper-params, robust defaults), faster to train in parallel, and does not overfit badly with extra trees. XGBoost usually achieves lower test error with tuning (especially on competition datasets) but requires careful LR + regularisation tuning. For a quick baseline: RF. For max performance with budget to tune: XGBoost/LightGBM.
Describe data leakage and give a concrete example.
Leakage: test-set information flows into the training pipeline. Example: standardising features (subtract mean, divide std) computed over the full dataset before train/test split — the test mean is baked into training, making validation metrics overly optimistic. Fix: put all preprocessing inside a Pipeline / fit on train only.
What is double descent? How does it relate to transformer training?
Belkin et al. 2019: as model size increases past the interpolation threshold (zero training loss), test error rises then falls again. Transformers trained with billions of parameters are deep in the "second descent" regime — they memorise training data yet generalise because SGD noise + weight decay act as implicit regularisers. This is why scaling laws show monotone improvement with parameter count.
Walk through gradient boosting for one step using MSE loss. What does XGBoost add?
Negative gradient of MSE = residuals r_i = y_i - F(x_i). Fit a tree to residuals, update F += ν·tree. XGBoost adds: (1) second-order Taylor expansion for better leaf-value estimate; (2) L1/L2 regularisation on leaf scores; (3) column subsampling; (4) GPU histogram-based split finding — making it orders of magnitude faster.
You have an imbalanced binary classification problem (1% positives). Which metrics do you use and why?
Accuracy is misleading (99% accuracy by predicting all-negative). Prefer Precision-Recall AUC (focuses on positive class performance), F1 (harmonic mean of P and R), or Brier score (proper scoring rule for calibration). ROC-AUC is less sensitive to class imbalance than PR-AUC for rare positives. If cost of false negatives ≫ false positives, optimise recall at a fixed precision threshold.