Part I — Foundations

§ 1.2 AI Taxonomy & Industry Landscape

Where LLMs sit in the broader AI family; how the Transformer colonized every sub-field; who the frontier labs are in 2026; and which curriculum parts each role actually needs.

1. Overview

Before studying LLMs in depth, you need a mental map: where they sit in the broader AI family, which research communities created them, who is building them commercially today, and which curriculum parts you need based on the career you are targeting. This page answers all four questions. Read it once now; the job table in §5 will make more sense after you have worked through Parts IV–X.

The key insight from the diagram: LLMs are deep learning applied to text at scale. Reinforcement Learning (RL) overlaps both DL and classical ML; it reappears in post-training (RLHF, GRPO) where it is used to align LLMs to human preferences. Symbolic AI remains relevant as the verifier in neuro-symbolic systems and in formal-methods tooling.

2. AI vs ML vs DL vs LLM

These terms are not synonymous: each inner set is a strict subset of the outer one. The distinctions matter in practice because they determine what kind of data, compute, and expertise each approach requires.

Term	Definition	Key distinguishing property from parent
AI	Any technique that makes a machine exhibit intelligent behaviour	Includes rule-based expert systems and logic solvers — no learning required
ML	AI that improves from data rather than hand-coded rules	Requires a training set; model parameters are optimized by an objective function
DL	ML using multi-layer neural networks that learn hierarchical representations	No manual feature engineering; features are learned end-to-end from raw inputs
LLM	DL model (transformer-based) pre-trained on internet-scale text via next-token prediction	Scale (≥7B params, ≥1T tokens) + emergent capabilities: in-context learning, chain-of-thought, tool use

Where does Reinforcement Learning (RL) fit?

RL is a learning paradigm, not a model family. An RL agent learns by trial-and-error with delayed reward signals rather than from labelled examples. It overlaps classical ML (tabular Q-learning) and DL (deep Q-network, neural policy gradient). In the LLM context, RL appears in post-training: RLHF and GRPO use policy-gradient algorithms to push the base model toward preferred outputs, making RL an essential ingredient of ChatGPT-class alignment. Parts XVII and IX cover this in depth.

Why is "AGI" slippery?

Artificial General Intelligence has no agreed mathematical definition. Common informal definitions differ on the test (Turing test? all human cognitive tasks? ARC-AGI benchmark?), the subject (human-level? expert-level?), and the timescale (stateless test? lifelong learning?). Frontier labs use the term differently: OpenAI defines AGI as "highly autonomous systems that outperform humans at most economically valuable work"; Anthropic focuses on "transformative AI"; Google DeepMind uses AGI as an internal milestone framework. For this curriculum, treat AGI as a direction of travel, not a crisp target.

3. Sub-fields & Transformer Adoption

Since 2020, the transformer architecture has colonized every major AI sub-field. The diagram below shows the adoption sequence; each node gives the canonical first transformer-dominant paper in that domain.

Sub-field	Pre-transformer SOTA	Transformer inflection point	State of art 2026
NLP / LLM	LSTM seq2seq, ELMo, word2vec	BERT 2018, GPT-1 2018	GPT-4o, Claude 3.5, Gemini 2.5 Pro
Computer Vision	ResNet, EfficientNet, Inception	ViT 2020 (arXiv:2010.11929)	SAM-2, Florence-2, InternViT
Speech / ASR	HMM-DNN, wav2vec 1.0, DeepSpeech	Conformer 2020, Whisper 2022	Whisper-large-v3, SeamlessM4T-v2
Code Generation	Statistical models, tree-LSTM	Codex 2021, GitHub Copilot	DeepSeek-Coder-V2, claude-sonnet-4-6
Reinforcement Learning	DQN, PPO, SAC, MuZero	Decision Transformer 2021	GRPO (DeepSeek-R1), process reward models
Multimodal	CNN + RNN captioning, VQA models	CLIP 2021, Flamingo 2022	GPT-4o, Gemini 2.5 Pro, LLaVA-1.6
Computational Biology	Co-evolutionary stats, RoseTTAFold	AlphaFold-2 2021 (Nature 596)	ESM-3, AlphaFold-3, Evo 2024
Embodied AI	Model-predictive control, SAC robotics	RT-2 2023 (VLM + robot actions)	Pi-0, Octo, RoboFlamingo, GROOT

4. Industry Landscape (2026)

The LLM industry has consolidated around a handful of frontier labs — organizations that can afford the compute to train at the frontier (>$50M per run) and have the talent density to push architecture, post-training, and systems simultaneously.

4.1 United States & Europe

4.2 China

Chinese labs have closed the frontier gap rapidly. DeepSeek-V3 (2024) matched GPT-4-class quality at a reported training cost of <$6M; DeepSeek-R1 matched o1-class reasoning. The ecosystem is structurally different from the US: domestic compute (Ascend, Cambricon) + heavy quantization pressure from US export controls drives aggressive efficiency research (MLA, FP8 training, mixture-of-experts routing innovations).

4.3 Open-Source Ecosystem

The open-source layer wraps around every frontier lab:

Hugging Face — model hub, transformers library, datasets, Spaces; de-facto model distribution layer
vLLM — PagedAttention-based inference server; de-facto production serving engine (Part X)
SGLang — structured generation with fast prefix-caching; competitor to vLLM at scale
LangChain / LlamaIndex — RAG pipelines and agent orchestration (Parts XVIII, XIX)
Triton — Python DSL for GPU kernels; Inductor backend in torch.compile (Part XIV)
PyTorch / JAX — training frameworks; PyTorch dominates research, JAX at Google
Megatron-LM — NVIDIA's reference for 3D-parallel pre-training (Part XI)
DeepSpeed — Microsoft's ZeRO optimizer and pipeline-parallel framework (Part XI)

4.4 Hardware Ecosystem

NVIDIA holds ~80% of AI training compute through the CUDA moat: NCCL collectives, cuBLAS/cuDNN kernel libraries, FlashAttention CUDA extensions, Megatron-LM, DeepSpeed, and all of torch.cuda are optimized for NVIDIA hardware first.

5. Job Map & Curriculum Priority

Different roles hire on very different subsets of this curriculum. The diagram below shows the primary career paths; the table maps each role to the curriculum parts that interviewers will actually quiz you on.

Role	Must Master (★★★)	Important (★★)	Helpful (★)
Research Scientist	II, IV, VI, VII, VIII, IX, XVII, XX	I, III, V, XIV, XVI, XVIII, XXI, XXII	X, XI, XII, XIII, XV, XIX
Research Engineer	IV, VI, VII, IX, XX, XXII	I, II, VIII, X, XI, XIV, XVII, XXI	III, V, XII, XIII, XV, XVI, XVIII, XIX
ML Infra Engineer	VI, VII, X, XI, XII, XIII, XIV, XV, XXII	I, II, IV, VIII, IX, XIX, XXI	III, V, XVI, XVII, XVIII, XX
GPU / Kernel Engineer	VI, XII, XIII, XIV, XXII	II, IV, VII, VIII, X, XI, XV	I, III, V, IX, XVI, XVII, XVIII, XIX, XX, XXI
Applied AI / Product	IX, X, XVI, XVIII, XIX, XXI, XXII	I, III, IV, V, VI, VIII, XVII, XX	II, VII, XI, XII, XIII, XIV, XV
Safety / Alignment	IX, XVII, XX, XXII	I, II, IV, VI, VII, VIII, XXI	X, XI, XVI, XVIII, XIX

6. Minimal Demo — Job-Role Skill Radar

Enter a role number (1–4) on stdin and the program prints a priority bar for all 22 curriculum parts. Try switching roles to see how different the study plans are: a Kernel Engineer (3) needs almost no RAG knowledge, while an Applied AI Engineer (4) barely needs CUDA depth.

Input format: a single integer — 1 Research Scientist 2 ML Infra 3 GPU/Kernel 4 Applied AI

Job-Role Skill Radar — C Demo

stdin (optional)

7. Production & Source Pointers

Where to find the canonical implementations of the concepts introduced on this page:

PyTorch nn.Transformer — torch/nn/modules/transformer.py. The canonical reference for multi-head attention + encoder/decoder.
Hugging Face transformers — src/transformers/models/— one sub-directory per architecture (llama, gpt2, mistral, gemma, qwen2, …).
vLLM engine — vllm/engine/llm_engine.py— LLMEngine class; read to understand continuous-batching + PagedAttention glue.
Megatron-LM — megatron/core/tensor_parallel/— column / row parallel linear layers as described in Narayanan et al. 2021.
FlashAttention — flash_attn/flash_attn_interface.py; CUDA kernel in csrc/flash_attn/.
nanoGPT — model.py (github.com/karpathy/nanoGPT) — 300-line GPT-2 you can read in one sitting.
DeepSeek-V3 — inference/model.py (github.com/deepseek-ai/DeepSeek-V3) — MLA attention + MoE routing in ~800 lines.

8. References

Papers

Vaswani et al. 2017 — Attention is All You Need. arXiv:1706.03762
Dosovitskiy et al. 2020 — An Image is Worth 16×16 Words (ViT). arXiv:2010.11929
Radford et al. 2021 — Learning Transferable Visual Models From Natural Language Supervision (CLIP). arXiv:2103.00020
Gulati et al. 2020 — Conformer: Convolution-augmented Transformer for Speech Recognition. arXiv:2005.08100
Chen et al. 2021 — Decision Transformer: Reinforcement Learning via Sequence Modeling. arXiv:2106.01345
Jumper et al. 2021 — Highly accurate protein structure prediction with AlphaFold. Nature 596.
Brown et al. 2020 — Language Models are Few-Shot Learners (GPT-3). arXiv:2005.14165
Ouyang et al. 2022 — Training language models to follow instructions with human feedback (InstructGPT). arXiv:2203.02155
DeepSeek-AI 2024 — DeepSeek-V3 Technical Report. arXiv:2412.19437
DeepSeek-AI 2025 — DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via RL. arXiv:2501.12948
Shazeer et al. 2017 — Outrageously Large Neural Networks: The Sparsely-Gated MoE Layer. arXiv:1701.06538

Lectures

Stanford CS221 (Intro to AI) — Percy Liang — covers AI taxonomy and sub-field survey
Stanford CS224n (NLP with Deep Learning) — unit 1: from classical NLP to transformers
Stanford CS336 (Building LLMs from Scratch) — lecture 1: full LLM landscape overview
MIT 6.S191 (Intro to Deep Learning) — lecture 1: DL ecosystem and industry landscape
Berkeley CS294-196 (LLM Agents) — lecture 1: LLM landscape and agent paradigms
Karpathy "State of GPT" (Microsoft Build 2023) — 90-minute pipeline walkthrough
Oxford Deep Learning for NLP (Hilary term) — tutorial 1: overview of NLP sub-fields

Textbooks

Russell & Norvig — Artificial Intelligence: A Modern Approach (4th ed.) — Part I: overview; Appendix: AI sub-fields
Goodfellow, Bengio, Courville — Deep Learning (MIT Press 2016) — Chapter 1: Introduction and landscape
Jurafsky & Martin — Speech and Language Processing (3rd ed. draft) — Chapter 1: NLP overview

Blog Posts & Reports

AI Index 2024 Annual Report (Stanford HAI) — compute trends, lab landscape, benchmark progress
Anthropic "Core Views on AI Safety" (anthropic.com) — mission and AGI-risk framing
OpenAI Charter (openai.com/charter) — AGI definition used by OpenAI
Epoch AI "Machine Learning Hardware" tracker — GPU capability and cost over time
Semianalysis "The AI Chip Landscape" — deep dives on NVIDIA / AMD / custom silicon economics

Code & Repos

github.com/huggingface/transformers — all major LLM architectures in one repo
github.com/vllm-project/vllm — PagedAttention inference server
github.com/NVIDIA/Megatron-LM — 3D-parallel pre-training reference
github.com/karpathy/nanoGPT — 300-line GPT-2 reference implementation
github.com/deepseek-ai/DeepSeek-V3 — MLA + MoE architecture in production

9. Interview Prep

These questions appear in ML breadth / system-design screens across frontier labs:

1. Draw the AI ⊇ ML ⊇ DL ⊇ LLM containment diagram. Where does RL sit?

Answer: RL sits inside ML (it is a learning paradigm), and overlaps DL when neural networks are used as value or policy functions. In the LLM context, RL appears outside the base pre-training loop — as a post-training step (RLHF, GRPO) that shifts the model's output distribution toward preferred outputs using a reward signal.

2. Compare symbolic AI, connectionism, and statistical ML on data hunger, interpretability, and generalization.

Answer:
Symbolic AI: no training data (hand-coded rules); fully interpretable (rule traces); brittle generalization (rule coverage = reachability).
Connectionism (DL): data-hungry (millions+ examples); largely uninterpretable (mechanistic interp is open research); strong generalization at scale.
Statistical ML: moderate data need (thousands to millions); feature-level interpretability (coefficients, SHAP); best at small-data, structured regimes.

3. Name one architectural innovation each of these labs is primarily known for: OpenAI, Anthropic, DeepSeek, Meta AI.

Answer:
OpenAI: RLHF pipeline (InstructGPT 2022) and inference-time compute scaling (o1, 2024).
Anthropic: Constitutional AI — self-critique RLHF loop (CAI 2022) and claude-sonnet-4-6 successor models.
DeepSeek: Multi-head Latent Attention (MLA, 2024) reducing KV-cache ~93%; MoE training at <$6M reported cost (V3 2024).
Meta AI: RoPE + GQA + SwiGLU + RMSNorm combination in LLaMA — now the industry-default open-weight architecture.

4. Why does NVIDIA hold ~80% of AI training compute despite AMD having competitive hardware specs?

Answer: The CUDA software moat. NCCL collectives, cuBLAS/cuDNN kernels, FlashAttention CUDA extensions, Megatron-LM, DeepSpeed, and all of torch.cuda are optimized for and validated on NVIDIA hardware first. AMD's ROCm has improved but lags by 1–2 years of kernel-level optimization. Switching cost = revalidating every collective primitive, mixed-precision path, and third-party kernel.

5. What hardware advantage does Groq have over NVIDIA H100 for LLM inference, and what is the trade-off?

Answer: Groq's LPU uses SRAM only (no HBM), eliminating the memory-bandwidth bottleneck of autoregressive decode. Token latency is <1 ms vs ~10 ms on H100 for a 70B model. The trade-off: SRAM capacity is tiny (~220 MB per chip), so large models require many chips in a pipeline — increasing per-token cost and limiting batch size compared to an H100 with 80 GB HBM.

6. Explain the difference between NVLink and InfiniBand. When would you choose each?

Answer: NVLink / NVSwitch is an intra-node interconnect (up to 8 GPUs per node, 900 GB/s all-to-all), using a proprietary NVIDIA protocol. InfiniBand NDR is an inter-node fabric (400 Gbps per GPU port, RDMA-capable) connecting thousands of nodes in a training cluster. Use NVLink for tensor-parallel splits within a single node; use InfiniBand for all-reduce and pipeline-parallel traffic across nodes.

7. A candidate says they want to "work on LLMs." What clarifying questions determine which role they are suited for?

Answer (key questions):

Do you write papers or ship systems? → Research Scientist vs Research Engineer.
Have you written CUDA kernels or profiled GPU occupancy? → Kernel Engineer vs ML Engineer.
Have you set up distributed training across multiple nodes? → ML Infra vs Applied AI.
Do you use LLMs as a tool to build products, or do you train/fine-tune them? → Applied vs Research.
What excites you more: a 20% accuracy gain or a 5× throughput gain? → Research vs Systems orientation.