Part III — Classical ML

§ 8. Unsupervised Learning & Evaluation

K-means · GMM/EM · DBSCAN · PCA · t-SNE · UMAP · ROC/AUC · F1 · Calibration · Cross-validation · Leakage

1. Overview

Unsupervised learning discovers structure in data without labels — clustering groups similar examples, while dimensionality reduction finds compact representations. Evaluation metrics and cross-validation then answer the orthogonal question: once we have a model (supervised or not), how do we measure and select it without fooling ourselves? These two halves of this page are complementary: good clustering features feed good classifiers; good evaluation practice separates papers that generalise from papers that overfit their test set.

2. Clustering Algorithms

K-Means

K-means minimises the Within-Cluster Sum of Squares (WCSS):

WCSS = Σ_{k=1}^K Σ_{x ∈ C_k}  ‖x − μ_k‖²

The algorithm alternates between two steps: assign each point to its nearest centroid (Voronoi partition), then recompute each centroid as the mean of its assigned points. Both steps guarantee WCSS cannot increase, so the algorithm always converges — but to a local minimum. k-means++ initialization (spread initial centroids proportional to squared distance from existing ones) dramatically reduces the chance of a bad local minimum.

Walkthrough (k=2, 4 points): Points p1=[1,1], p2=[1,2], p3=[5,4], p4=[5,5]. Init centroids μ₁=[1,1], μ₂=[5,4]. Round 1: p1,p2 → C₁; p3,p4 → C₂. Update: μ₁=[1, 1.5], μ₂=[5, 4.5]. Round 2: assignments unchanged → converged. WCSS = (0+0.25) + (0.25+0) = 0.5. Optimal.

Choosing k: Plot WCSS vs k (the "elbow" curve). The elbow point is where adding another cluster yields diminishing returns. Silhouette score offers a principled alternative: it measures how much tighter each point fits its own cluster vs the nearest other cluster (range −1 to +1; higher = better).

Limitations: Assumes spherical, equal-size clusters; sensitive to outliers; must specify k in advance; not deterministic (random init).

Gaussian Mixture Models (GMM) via EM

GMM models the data as a weighted sum of Gaussians:p(x) = Σ_k π_k · N(x; μ_k, Σ_k). Unlike K-means (hard assignment), GMM assigns soft responsibilities — a point near two cluster boundaries belongs fractionally to each. The Expectation-Maximization (EM) algorithm maximises the log-likelihood by alternating between computing responsibilities (E-step) and updating parameters (M-step).

GMM generalises K-means: setting all covariances to σ²I and taking the hard-assignment limit recovers K-means exactly. Full covariance matrices allow the model to capture elliptical clusters. The BIC / AIC criteria help select K.

DBSCAN

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) does not require specifying k. A point is a core point if at least min_samples points lie within radius eps. Clusters grow by connecting core points that are density-reachable. Points reachable from no core point are labelled noise (−1). This makes DBSCAN robust to outliers and able to find arbitrarily shaped clusters — something K-means and GMM cannot do.

AlgorithmCluster shapeNeeds k?Outlier handlingSoft assign?
K-meansSphericalYesPulled into clusterNo
GMM/EMEllipticalYes (BIC)DownweightedYes
DBSCANArbitraryNo (eps, min_samples)Labelled noiseNo
Hierarchical (Ward)AnyPost-hoc cutoffBecomes own clusterNo

Hierarchical Clustering

Agglomerative hierarchical clustering starts with each point as its own cluster and greedily merges the two closest clusters (by linkage criterion) until one cluster remains. The result is a dendrogram — cut it at any height to get the desired k. Linkage options: single (min distance, chain-like clusters), complete (max distance, compact), average (UPGMA), and Ward (minimises WCSS increase on merge — produces compact, spherical clusters similar to K-means but hierarchically).

3. Dimensionality Reduction

PCA — Principal Component Analysis

PCA finds the orthogonal directions of maximum variance in the data. The first principal component (PC1) is the unit vector v that maximisesVar(Xv) = v^T Cov(X) v, subject to ‖v‖=1. The solution is the eigenvector of the covariance matrix with the largest eigenvalue. In practice, computing the covariance matrix explicitly is numerically unstable for high-D data; instead we use the SVD of the centered data matrix, which is equivalent and more numerically stable.

Why SVD equals PCA: The covariance matrix isC = (1/N) X_c^T X_c. If X_c = U S V^T, thenC = (1/N) V S^2 V^T. The columns of V are the eigenvectors of C (principal axes), and the eigenvalues areλ_j = s_j² / N. So running SVD on X_c gives PCA for free — no explicit covariance computation needed.

t-SNE

t-SNE (van der Maaten & Hinton, 2008) is a non-linear embedding method designed for visualisation. It converts pairwise high-dimensional distances into conditional probabilities (using a Gaussian kernel with bandwidth tuned by perplexity), then minimises the KL divergence between those high-D probabilities and low-D Student-t probabilities. The heavy tails of the Student-t prevent the "crowding problem" — nearby clusters stay separated even when mapping from hundreds of dimensions to 2D.

t-SNE warnings: (1) distances between clusters are NOT meaningful — only local neighbourhood structure is preserved. (2) Cluster sizes and shapes in the 2D plot are an artefact of perplexity. (3) Results are stochastic and can differ across runs. Never use t-SNE to measure inter-cluster distance.

UMAP

UMAP (McInnes et al., 2018) is grounded in Riemannian geometry and fuzzy simplicial sets. In practice it is 5–100× faster than t-SNE, better preserves global structure (macro-level relationships between clusters), and supports both unsupervised and semi-supervised modes. It is the preferred choice for production pipelines that need reproducible, interpretable 2D/3D embeddings (e.g., token-space visualisation, anomaly detection, nearest-neighbour retrieval).

MethodNon-linear?Preserves global?SpeedUse case
PCANoYes (linear)O(D² N)Preprocessing, noise reduction, whitening
t-SNEYesPoorO(N² log N)Local cluster visualisation only
UMAPYesGoodO(N^1.14)Embeddings, retrieval, visualisation

4. Minimal Demo — PCA via SVD

The demo below implements PCA entirely from scratch using torch.linalg.svd. Three clusters hidden in a 4D space are correctly separated in the 2D PCA projection. Watch the explained variance ratios — PC1+PC2 should capture ≈ 90 % of variance because the data was deliberately generated with only 2D signal.

PCA via SVD (pure torch) — C Demo
stdin (optional)

5. Evaluation Metrics

Confusion Matrix Basics

All binary classification metrics derive from four counts that depend on a decision threshold θ:

Predicted PositivePredicted Negative
Actual PositiveTP (true positive)FN (false negative — miss)
Actual NegativeFP (false positive — alarm)TN (true negative)
Accuracy  = (TP + TN) / N
Precision = TP / (TP + FP)        ← "of all alarms, how many were real?"
Recall    = TP / (TP + FN)        ← "of all positives, how many did we catch?"
F1        = 2 · P · R / (P + R)   ← harmonic mean; punishes extreme imbalance
FPR       = FP / (FP + TN)        ← false alarm rate (x-axis of ROC)

These metrics all depend on the threshold θ. To get a threshold-independent summary we sweep θ across all possible values and plot either the ROC curve (FPR vs TPR) or the PR curve (Recall vs Precision).

Calibration & Brier Score

A model is calibrated if its predicted probability p̂ = 0.7 means "this event happens 70% of the time". A reliability diagram bins predictions by p̂ and plots mean p̂ vs empirical fraction — a perfectly calibrated model falls on the diagonal. Models trained with cross-entropy tend to be better calibrated than margin-based models (SVM, RF), which often need Platt scaling or isotonic regression post-processing.

The Brier score is a proper scoring rule that measures both accuracy and calibration simultaneously:

Brier = (1/N) Σ_i (p̂_i − y_i)²     [0 = perfect, 0.25 = random, 1 = perfectly wrong]

Brier score is lower-bounded at 0 and decomposable into reliability + resolution − uncertainty. Use it when downstream decisions depend on calibrated probabilities (e.g., risk scoring, A/B test power calculations, cost-sensitive decisions).

6. Minimal Demo — Metrics Calculator

The demo simulates 50 binary predictions (20 positives, 30 negatives) and computes the full confusion matrix, P/R/F1/Acc at six thresholds, AUC-ROC via the trapezoidal rule, and the Brier score — all in plain Python with zero imports beyond random. Notice how precision rises and recall falls as the threshold increases.

Confusion Matrix → Metrics Calculator — C Demo
stdin (optional)

7. Cross-Validation & Model Selection

k-Fold CV

Standard k-fold CV partitions the dataset into k equal folds, trains k times each leaving one fold out as the validation set, and averages the k scores. This gives a nearly unbiased estimate of generalisation performance while using all data for evaluation. Common choices: k=5 (good variance/bias tradeoff), k=10 (slightly lower bias, higher compute), leave-one-out (k=N, high variance on small datasets).

Stratified k-Fold

For imbalanced classification (e.g., 5% positive class), standard k-fold may place all positives in one fold by chance. Stratified k-fold ensures each fold has the same class proportion as the full dataset. Always use stratified CV for binary classification unless the dataset is large (> 10k per class).

Time-Series CV

For time-ordered data (stock prices, logs, sensor streams), standard k-fold is invalid: it uses future data to train models that predict the past. Use the walk-forward (expanding window) or sliding windowscheme instead: always train only on data before the validation window.

Walk-forward (expanding window):
  Train=[t0..t1]       Val=[t1..t2]
  Train=[t0..t2]       Val=[t2..t3]
  Train=[t0..t3]       Val=[t3..t4]

Sliding window:
  Train=[t0..t1]       Val=[t1..t2]
  Train=[t1..t2]       Val=[t2..t3]

Nested CV (Unbiased Hyperparam + Eval)

If you use k-fold both to select hyperparameters and to report final accuracy, the reported score is optimistically biased (the validation folds were used to make model choices). Nested CV wraps an inner loop (hyperparameter selection) inside an outer loop (final score estimation):

Outer loop (k_out=5 folds — estimates generalisation error):
  for each outer fold j:
    Inner loop (k_in=3 folds on outer TRAIN split — selects best hyp):
      grid search over C, gamma, ...
      pick best_params_j
    Train final model on outer TRAIN with best_params_j
    Evaluate on outer VAL → score_j
Final estimate = mean(score_1..5)    ← unbiased

Nested CV is expensive (5 × 3 = 15 training runs per grid point) but is the gold standard for unbiased model comparison on small datasets. On large datasets (> 100k samples) a simple train/val/test split is usually sufficient.

8. Common Pitfalls

Data Leakage

Data leakage occurs when information from the validation/test set flows into model training. The result is an inflated training-time metric that collapses in production.

Leakage TypeExampleFix
Feature leakageNormalise with mean/std computed on full dataset (including test)Fit scaler only on train fold; transform val/test
Target leakageFeature contains future information (e.g., 'refund_requested' predicting churn)Audit feature timestamps; enforce train < test cutoff
Preprocessing leakagePCA/SMOTE applied before CV splitAll preprocessing inside the CV pipeline
Duplicate leakageSame record in train and test (after dedup failure)Dedup before splitting; use hash-based splits

Label Noise

Label noise (mislabelled examples) inflates test error and makes learning harder. Detecting it: look for high-confidence wrong predictions from an ensemble, or use "cleanlab" / confident learning to flag suspicious labels. Mitigation: label smoothing (replaces hard 0/1 targets with ε/K), noise-robust loss functions (symmetric cross-entropy), or semi-supervised re-labelling.

Train/Test Contamination

A related but distinct problem: if the same entity appears in both train and test (e.g., the same user ID, same image with minor augmentation, same news article with different timestamps), the model learns to recognise the entity rather than the pattern. Fix: split at the entity level, not at the sample level.

Interview red flag: "We got 99% accuracy on our held-out test set, but production is at 62%." Almost always caused by one of: feature leakage, preprocessing leakage, or entity-level contamination. First question: "Was the scaler fit on the full dataset?"

9. Feature Engineering for Tabular Data

Still critical in production tabular ML (fraud detection, recommendation ranking, search). Deep learning has not displaced feature engineering for structured data with < 1M rows.

TechniqueWhen to useLeakage risk
StandardScaler (z-score)Gradient-based models (LR, SVM, NN)Fit on train only
MinMaxScaler [0,1]Neural nets, KNN, when distribution is boundedFit on train only
RobustScaler (IQR)Data with outliersFit on train only
One-Hot EncodingLow-cardinality nominals (< 50 categories)None (structural)
Target EncodingHigh-cardinality nominals (city, user_id)MUST use out-of-fold — otherwise leaks label
Mean imputationMCAR (missing completely at random)Low — use train mean
Model-based imputationMAR (missing at random, correlated)Must be inside CV pipeline
Interaction featuresTree models that miss multiplicative effectsNone

10. Production / Source Pointers

Algorithm / Toolscikit-learn pathNotes
K-Meanssklearn.cluster.KMeansinit='k-means++', n_init=10
GMMsklearn.mixture.GaussianMixturecovariance_type='full'|'tied'|'diag'
DBSCANsklearn.cluster.DBSCANTune eps with k-distance graph
Hierarchicalsklearn.cluster.AgglomerativeClusteringlinkage='ward' for compact clusters
PCAsklearn.decomposition.PCAsvd_solver='full'|'randomized' (Halko et al.)
t-SNEsklearn.manifold.TSNEUse PCA init, perplexity=30–50
UMAPumap.UMAP (separate package)n_neighbors=15, min_dist=0.1
k-Fold CVsklearn.model_selection.StratifiedKFoldAlways stratify for classification
Pipelinesklearn.pipeline.PipelineEnsures preprocessing fits inside CV
Calibrationsklearn.calibration.CalibratedClassifierCVmethod='sigmoid'|'isotonic'

11. References

Papers

  • van der Maaten & Hinton (2008) — Visualizing Data using t-SNE, JMLR 9.
  • McInnes, Healy & Melville (2018) — UMAP: Uniform Manifold Approximation and Projection (arXiv:1802.03426)
  • Ester et al. (1996) — A Density-Based Algorithm for Discovering Clusters (DBSCAN), KDD.
  • Arthur & Vassilvitskii (2007) — k-means++: The Advantages of Careful Seeding, SODA.
  • Dempster, Laird & Rubin (1977) — Maximum Likelihood from Incomplete Data via the EM Algorithm, JRSS-B.
  • Platt (1999) — Probabilistic Outputs for SVMs (Platt scaling calibration).

Lectures

  • Stanford CS229 — Andrew Ng: Lecture Notes on EM, PCA, k-Means (freely available)
  • CMU 10-701 / 15-781 — Tom Mitchell / Eric Xing: Unsupervised Learning slides
  • Caltech CS156 — Yaser Abu-Mostafa: Learning From Data (lecture 12 on validation)
  • Oxford ML Hilary — Lecture 8: Clustering & Dimensionality Reduction
  • MIT 6.036 — Lecture 10: Unsupervised Learning

Textbooks

  • Hastie, Tibshirani & Friedman — The Elements of Statistical Learning, Ch. 13–14 (free PDF at web.stanford.edu/~hastie/ElemStatLearn)
  • Bishop — Pattern Recognition and Machine Learning, Ch. 9 (EM), Ch. 12 (PCA)
  • Murphy — Probabilistic Machine Learning: An Introduction, Ch. 20–22
  • James et al. — Introduction to Statistical Learning with R/Python, Ch. 12

Code / Repos

  • scikit-learn — sklearn.cluster, sklearn.decomposition, sklearn.model_selection
  • umap-learn — github.com/lmcinnes/umap
  • cleanlab — github.com/cleanlab/cleanlab — confident learning for label noise

Blog Posts

  • distill.pub — How to Use t-SNE Effectively (Wattenberg et al., 2016)
  • Sebastian Raschka — Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning
  • Andrej Karpathy — A Recipe for Training Neural Networks (covers leakage / overfit debugging)

12. Interview Prep

Q1. Prove that PCA is equivalent to SVD on the centred data matrix.

The covariance matrix is C = (1/N) Xc^T Xc. The SVD of Xc = U S V^T gives Xc^T Xc = V S^2 V^T. So the columns of V are eigenvectors of C (principal axes), and eigenvalues λ_j = s_j²/N. PCA reduces to finding V via svd(Xc) — no explicit C computation needed.

Q2. When should you use AUC-ROC vs PR-AUC?

ROC AUC is threshold-invariant and symmetric w.r.t. classes — appropriate when class balance is roughly even. PR-AUC is better for severe class imbalance (fraud, rare disease) because ROC's x-axis (FPR) can be very small even with many false positives when TN is huge. PR-AUC directly measures the precision-recall tradeoff on the rare class.

Q3. Describe a data-leakage bug you've seen or can invent, and how you'd detect it.

Classic: computing StandardScaler mean/std on the full dataset before splitting, then fitting on train. Detection: final CV score dramatically higher than a simple baseline, or feature importances showing an unexpected feature at the top. Fix: put all preprocessing inside a Pipeline so fit() only sees train folds.

Q4. Why is t-SNE bad for preserving global structure? When is UMAP better?

t-SNE optimises KL divergence of local Gaussian distributions, intentionally ignoring long-range distances. This means cluster positions relative to each other are meaningless. UMAP preserves more global topology because its cost function combines both local (attractive) and global (repulsive) terms, and it models manifold structure with fuzzy simplicial sets. Use UMAP when you need interpretable cluster relationships or need to re-use the embedding on new data (UMAP supports transform(), t-SNE does not).

Q5. What is nested cross-validation and why is it necessary?

Standard k-fold CV with grid search optimistically biases the reported metric because validation folds influenced hyperparameter choice. Nested CV has an outer loop that estimates generalisation error and an inner loop that selects hyperparameters independently for each outer fold. The outer CV score is an unbiased estimate. Cost: k_out × k_in × |grid| model fits.

Q6. K-means converges but not necessarily to the global optimum — why, and how do you mitigate it?

K-means is a coordinate descent on a non-convex WCSS objective — it monotonically decreases WCSS each iteration but can get trapped in local minima depending on initialization. Mitigation: (1) k-means++ initialization spreads centroids, reducing bad local minima dramatically; (2) run n_init=10 random restarts and keep the best WCSS; (3) use GMM/EM or hierarchical clustering as a sanity check.

Q7. What is the Brier score and how does it decompose?

Brier = mean((p̂_i - y_i)²). It decomposes as Reliability (calibration error) - Resolution (how variable predictions are above mean prevalence) + Uncertainty (irreducible noise). A model with high resolution but poor reliability can have good Brier score after calibration. Use it when downstream decisions require calibrated probabilities (not just ranking).