Part I — Foundations

§ 1.1 AI Past & Present: Turing to the LLM Era

Eighty-five years of artificial intelligence in one page — the breakthroughs, the winters, and why "scaling laws + transformer + RLHF" is the current recipe.

1. Overview

The history of AI is not a smooth march; it is six distinct paradigms separated by two severe funding winters. Every modern Large Language Model rests on artifacts that survived those winters: the formal idea of a computable function (1936), the artificial neuron (1943), back-propagation (1986), GPU general-purpose compute (2007), the attention mechanism (2017), and scaling laws (2020). Reading this page once gives every later Part of the curriculum the "why now?" context that pure technical material cannot supply.

Three intellectual currents have competed throughout: symbolic AI (rules and search), connectionism (learned weights in neural networks), and statistical machine learning (Bayesian / kernel methods on engineered features). The LLM wave is connectionism scaled to internet-sized data on transformer architectures, but symbolic verifiers and statistical baselines remain part of the modern stack.

2. Timeline of Milestones

Every milestone below either (a) introduced a primitive still used today, or (b) shifted the dominant paradigm. Year-by-year micro-events are omitted; cite the AI Index reports for those. The timeline is grouped by era so the rebrandings are visible.

2.1 Pre-AI Era (1936–1955)

Turing's 1936 paper on computable numbers defines what any future AI must compute. In 1943, neurophysiologist Warren McCulloch and logician Walter Pitts publish the first mathematical model of a neuron — a threshold unit over weighted Boolean inputs — and prove that networks of such units can compute any propositional logic statement. Donald Hebb's 1949 rule ("cells that fire together wire together") becomes the first local learning rule. Claude Shannon's 1948 information theory and Norbert Wiener's cybernetics close out the era. By 1950 Turing has framed the field's evaluation problem in his Imitation Game paper.

2.2 Symbolic AI Golden Age (1956–1974)

The 1956 Dartmouth Summer Research Project on Artificial Intelligence — organized by John McCarthy, Marvin Minsky, Claude Shannon and Nathaniel Rochester — coins the term artificial intelligence. The next 18 years see Newell & Simon's Logic Theorist and General Problem Solver, Weizenbaum's ELIZA (1966), Winograd's SHRDLU (1972) operating in a blocks world, and the rise of LISP as the AI language. Rosenblatt builds the first hardware perceptron in 1958. Minsky and Papert's 1969 Perceptrons book proves single-layer perceptrons cannot learn XOR — correct mathematically, but widely misread as "neural nets are doomed", which freezes connectionist funding for over a decade.

2.3 First Winter and Expert Systems (1974–1993)

The 1973 Lighthill report tells the UK Science Research Council that AI has failed to deliver and recommends pulling general AI funding; DARPA mostly follows. Recovery comes from narrowly-scoped expert systems: MYCIN (Stanford, 1976) diagnoses bacterial infections; DENDRAL identifies molecules; Digital Equipment's R1 / XCON saves DEC $40M/year configuring VAX orders. Japan's Fifth Generation Project (1982) tries to leapfrog with parallel LISP machines. In 1986 Rumelhart, Hinton and Williams re-popularize back-propagation, which Linnainmaa (1970) and Werbos (1974) had described earlier. The LISP-machine market collapses in 1987 when general-purpose workstations become faster; the second AI winter runs through ~1993.

2.4 Statistical ML and NLP (1993–2011)

The field rebrands as machine learning and statistics. Vapnik & Cortes publish the SVM (1995); Breiman publishes random forests (2001); Freund & Schapire formalize boosting; IBM's statistical machine translation team (Brown et al.) ships models 1–5 in the early 1990s; HMM-based speech recognition dominates. NVIDIA ships CUDA 1.0 in 2007, opening GPGPU programming. Fei-Fei Li releases ImageNet (~15M images across 22K classes) in 2009. IBM's Watson wins Jeopardy in 2011 using a 90-server hybrid symbolic / IR / ML stack. Hinton's 2006 deep belief net paper rekindles connectionism.

2.5 Deep Learning and LLM Era (2012–today)

AlexNet (Krizhevsky / Sutskever / Hinton, 2012) wins ImageNet by a 10-point margin using two GTX 580 GPUs, proving GPU-trained deep CNNs beat hand-engineered features at scale. The next five years bring VGG (2014), GoogLeNet (2014), ResNet (2015, residual connections unlock 100+ layers), BatchNorm, Adam, GAN, Seq2Seq with attention (Bahdanau 2014), GloVe, and DeepMind's AlphaGo (2016). In 2017 Vaswani et al. publish Attention is All You Need; the transformer becomes the universal architecture. 2018 brings BERT and GPT-1. 2020 brings GPT-3 (175B parameters) and the Kaplan scaling laws. ChatGPT in 2022-11 turns LLMs into a consumer product. 2023–2025 add GPT-4, Claude, Gemini, LLaMA, DeepSeek, Mixtral, then reasoning models (o1, DeepSeek-R1) and the agent wave. The compute frontier doubles roughly every 6–10 months.

3. Key Concepts & Inventions

Every primitive in the table below survived at least one winter and is still part of a modern training stack — either literally (e.g. backprop, attention) or as the ancestor of today's version.

Year	Primitive	Authors	Still used as
1943	Artificial neuron	McCulloch, Pitts	linear layer + activation in every NN
1949	Hebbian rule	Hebb	local-update intuition; SAEs, Hopfield nets
1958	Perceptron	Rosenblatt	single linear classifier; logistic regression
1970-1986	Back-propagation	Linnainmaa; Werbos; Rumelhart, Hinton, Williams	autograd in PyTorch / JAX
1989	CNN for digits	LeCun et al.	CV backbones; vision transformer patches
1997	LSTM	Hochreiter, Schmidhuber	RNN baseline before transformers; still in TTS, time-series
2007	CUDA	NVIDIA	every modern GPU kernel
2014	Seq2Seq + attention	Sutskever; Bahdanau	attention is the only mechanism in transformers
2017	Transformer	Vaswani et al.	backbone of every frontier LLM
2020	Scaling laws	Kaplan et al.; Hoffmann et al. 2022	predicts loss vs (params, tokens, compute)
2022	RLHF for chat	Ouyang et al. (InstructGPT)	DPO, RLAIF, RL-from-verifier descend from this

4. Core Mechanism — The AI Winter Cycle

Background

Two well-documented AI winters (1974–1980 and 1987–1993) and arguably a third mini-winter after the 2001 dotcom crash share an almost identical mechanism. Understanding the cycle matters because the 2023–2025 LLM boom shows several of the same warning signs (overpromising on agents, capability plateaus on certain benchmarks, training-cost inflation), and grizzled researchers explicitly invoke this history to argue for caution.

Plan

A demonstration on a toy task is publicized as general progress.
Funders, often a small group (DARPA, MITI, top VCs), commit large budgets.
Researchers and vendors over-promise generalization to evade combinatorial / data limits.
A high-profile failure or stagnating benchmark exposes the gap (Lighthill, MT report, LISP-machine collapse).
Funders cut budgets; the field rebrands ("machine learning", "data science").
An enabling breakthrough — usually cheaper compute or a bigger dataset — restarts the cycle.

Worked Example — The Second Winter (1987–1993)

We walk through the cycle on the second winter, because it is the cleanest case and includes both an over-promise event and a hardware-disruption event.

Phase	Year	State of the world
Demonstration	1980	DEC's R1/XCON saves $40M/year configuring VAX orders — expert systems look like the path to general AI.
Funding flood	1982–1985	Japan launches the Fifth Generation Project (~$400M). DARPA Strategic Computing Initiative funds Symbolics & LMI LISP machines.
Over-promise	1985	Vendors promise expert systems will handle commonsense reasoning; knowledge bases grow but generalization stalls (the "knowledge acquisition bottleneck").
Failure	1987	Sun-3 + UNIX workstations match LISP-machine performance at a fraction of the cost. Symbolics revenue collapses; XCON maintenance costs explode.
Budget cut	1988–1993	DARPA pivots to applications, not AI per se; Symbolics goes bankrupt 1992; AI-branded conferences shrink.
Breakthrough	1993–1998	Rebranding as "machine learning" + statistical methods + cheap clusters; SVM (1995), AdaBoost, statistical MT, HMM speech.

What the caller (a 2025 researcher) should take away: AI winters are caused by economic mismatch between promised and delivered capability, not by the technology being wrong. The connectionist primitives the 1969 perceptron critique was meant to kill were the ones that returned in 2012; the rule-based expert systems the 1987 winter buried are arguably returning today as "tool use" and "verifier" layers on top of LLMs. Plan a research career assuming the next winter is roughly one funding cycle away.

5. Minimal Demo — AI History Time Machine

Enter any year between 1936 and 2025. The program prints the dominant paradigm, representative hardware, the seminal paper of that era, and one sample SOTA task. Try 1958, 1975, 2012 and 2024 back-to-back to feel the paradigm shifts.

AI History Time Machine — C Demo

stdin (optional)

5.1 Compute-vs-capability scaling

Enter two years separated by a space (e.g. 2012 2024) to compute the effective compute doubling time across the deep-learning era. The OpenAI 2018 "AI and Compute" analysis estimated a 3.4-month doubling from 2012; the Stanford AI Index 2023 re-estimated it as ~6 months once inference-time compute was included.

Compute Scaling Curve (2012-2025) — C Demo

stdin (optional)

6. The Modern LLM Recipe

Every frontier LLM as of 2026 follows the same seven-ingredient recipe. Each ingredient entered the field at a specific date; together they explain why the LLM era began in 2017–2020 and why no single lab can disappear without competitors catching up.

Pedagogical link: ingredient 1 is covered in Part VII (Pre-training data), ingredient 2 in Part VI (Transformer), ingredient 3 in Parts XI–XV (Distributed / GPU / CUDA / Kernels / Networking), ingredient 4 in Part VII (Scaling laws), ingredients 5–7 in Part IX (Post-training). Use this page as the map.

7. Source / Resource Pointers

Stanford AI Index — annual snapshot of compute, capability, and economics. aiindex.stanford.edu
Epoch AI — database of training-compute estimates for every notable model since AlexNet. epochai.org
Karpathy "The Unreasonable Effectiveness of RNNs" — canonical primer that bridges classical seq models to modern attention.
Hugging Face Open LLM Leaderboard — live capability snapshot.
nanoGPT / llm.c (Karpathy) — minimal modern transformer; referenced again in Task T20.

8. References

Papers

Turing, A. 1950. "Computing Machinery and Intelligence." Mind 59(236).
McCulloch, W. & Pitts, W. 1943. "A Logical Calculus of the Ideas Immanent in Nervous Activity."
Rosenblatt, F. 1958. "The Perceptron." Psychological Review 65(6).
Minsky, M. & Papert, S. 1969. Perceptrons. MIT Press.
Rumelhart, D., Hinton, G., Williams, R. 1986. "Learning Representations by Back-Propagating Errors." Nature 323.
Hochreiter, S. & Schmidhuber, J. 1997. "Long Short-Term Memory." Neural Computation.
Krizhevsky, A., Sutskever, I., Hinton, G. 2012. "ImageNet Classification with Deep CNNs" (AlexNet).
Vaswani, A. et al. 2017. "Attention Is All You Need." arXiv:1706.03762
Devlin, J. et al. 2018. "BERT: Pre-training of Deep Bidirectional Transformers." arXiv:1810.04805
Radford, A. et al. 2018. "Improving Language Understanding by Generative Pre-Training" (GPT-1).
Brown, T. et al. 2020. "Language Models are Few-Shot Learners" (GPT-3). arXiv:2005.14165
Kaplan, J. et al. 2020. "Scaling Laws for Neural Language Models." arXiv:2001.08361
Hoffmann, J. et al. 2022. "Training Compute-Optimal Large Language Models" (Chinchilla). arXiv:2203.15556
Ouyang, L. et al. 2022. "Training Language Models to Follow Instructions with Human Feedback" (InstructGPT). arXiv:2203.02155
OpenAI 2024. "Learning to Reason with LLMs" (o1 system card).
DeepSeek-AI 2025. "DeepSeek-R1: Incentivizing Reasoning in LLMs via Reinforcement Learning." arXiv:2501.12948

Public Lectures

Stanford CS221 — Artificial Intelligence: Principles and Techniques (Percy Liang).
Stanford CS25 — Transformers United; in particular the "History of Transformers" lecture.
MIT 6.034 — Patrick Winston's legacy AI lectures (free on MIT OCW + YouTube).
MIT 6.S191 — Introduction to Deep Learning, opening "history of deep learning" lecture.
CMU 11-785 — Bhiksha Raj's Deep Learning course; the first lecture surveys the field.
Berkeley CS182 / CS282 — Sergey Levine's deep-learning course opening lecture.
Harvard CS50's Introduction to Artificial Intelligence with Python.
Oxford Machine Learning — legacy Nando de Freitas lectures.
DeepMind × UCL — David Silver's RL lecture series (sets up T50).
OpenAI Spinning Up in Deep RL — companion to Stanford CS285.
Hugging Face NLP Course — opening "What is NLP" chapter.

Textbooks

Russell, S. & Norvig, P. Artificial Intelligence: A Modern Approach (4th ed., 2020).
Nilsson, N. The Quest for Artificial Intelligence (Cambridge, 2009, free PDF).
Goodfellow, I., Bengio, Y., Courville, A. Deep Learning (MIT Press, 2016).
Jurafsky, D. & Martin, J. Speech and Language Processing (3rd ed. draft).
Murphy, K. Probabilistic Machine Learning: An Introduction & Advanced Topics.

Blog Posts & Surveys

OpenAI — AI and Compute (2018).
Stanford AI Index Annual Reports (2018–2025).
Karpathy — The Unreasonable Effectiveness of RNNs (2015); Software 2.0 (2017); State of GPT (2023).
Sutton, R. — The Bitter Lesson (2019).
Anthropic — Core Views on AI Safety.
Bender, E. et al. 2021. "On the Dangers of Stochastic Parrots." FAccT.
Epoch AI — Compute Trends Across Three Eras of Machine Learning (2022).

Code / Repos

karpathy/nanoGPT — readable modern transformer (revisited in T20).
karpathy/llm.c — transformer in C/CUDA.
karpathy/micrograd — toy autograd used in T09.
huggingface/transformers — reference modern LLM library.
vllm-project/vllm — production-grade inference engine (Part X).

9. Interview Prep

Question	Concise answer
Name one cause of each AI winter and one breakthrough that ended it.	First (1974): Lighthill report + failure of machine translation; revived by expert systems and cheap personal computers. Second (1987): LISP-machine market collapse + brittle expert systems; revived by statistical ML + faster commodity CPUs + the web's data.
Why didn't the 1969 Minsky-Papert perceptron critique kill neural nets?	Their proof was specifically about single-layer perceptrons; multi-layer networks were already known to solve XOR. Funding dried up for ~15 years but core ideas survived in Fukushima's Neocognitron, Hopfield nets, and the eventual rediscovery of backprop in 1986.
What changed between 2012 (AlexNet) and 2017 (Transformer) that enabled the LLM era?	Three concrete shifts: (1) CUDA + framework maturity (Theano → TensorFlow → PyTorch); (2) architectural innovations that scaled depth and parallelism (BatchNorm, residual connections, attention); (3) the realization that loss decreases predictably as a power law in compute, parameters and tokens.
Compare symbolic and connectionist AI on three axes.	Data hunger: symbolic = low / connectionist = high. Interpretability: symbolic = high / connectionist = low. Generalization: symbolic struggles outside the rule set; connectionist generalizes smoothly within distribution but can fail unpredictably out of distribution. Modern stacks combine both (LLM + verifier / tool-use).
Why is "scaling laws + transformer + RLHF" called the recipe of the LLM era?	Transformer gives a parallel-friendly architecture; scaling laws (Kaplan 2020, Chinchilla 2022) tell labs which compute / param / token ratio minimizes loss; RLHF (InstructGPT 2022) converts a raw next-token predictor into an instruction-following assistant. Removing any one ingredient gives a noticeably worse product.
List three frontier US labs and three Chinese labs with one architectural decision each.	OpenAI — reasoning RL on chain-of-thought (o1). Anthropic — Constitutional AI for harmlessness. Google DeepMind — long-context (Gemini 1M+ tokens). DeepSeek — Multi-Head Latent Attention + FP8 training (V3). Moonshot/Kimi — ultra-long-context windows. Qwen (Alibaba) — open MoE (Qwen-MoE) with sliding-window attention.
What is "The Bitter Lesson" and is it true?	Rich Sutton 2019: methods that scale with compute (search + learning) consistently beat methods that bake in human domain knowledge. Strongly supported by AlexNet, AlphaGo, and the LLM era; debated for tasks with limited data or strong physical priors (robotics, scientific ML).