Part I — Foundations

§ 1.2 AI Taxonomy & Industry Landscape

Where LLMs sit in the broader AI family; how the Transformer colonized every sub-field; who the frontier labs are in 2026; and which curriculum parts each role actually needs.

1. Overview

Before studying LLMs in depth, you need a mental map: where they sit in the broader AI family, which research communities created them, who is building them commercially today, and which curriculum parts you need based on the career you are targeting. This page answers all four questions. Read it once now; the job table in §5 will make more sense after you have worked through Parts IV–X.

The key insight from the diagram: LLMs are deep learning applied to text at scale. Reinforcement Learning (RL) overlaps both DL and classical ML; it reappears in post-training (RLHF, GRPO) where it is used to align LLMs to human preferences. Symbolic AI remains relevant as the verifier in neuro-symbolic systems and in formal-methods tooling.

2. AI vs ML vs DL vs LLM

These terms are not synonymous: each inner set is a strict subset of the outer one. The distinctions matter in practice because they determine what kind of data, compute, and expertise each approach requires.

TermDefinitionKey distinguishing property from parent
AIAny technique that makes a machine exhibit intelligent behaviourIncludes rule-based expert systems and logic solvers — no learning required
MLAI that improves from data rather than hand-coded rulesRequires a training set; model parameters are optimized by an objective function
DLML using multi-layer neural networks that learn hierarchical representationsNo manual feature engineering; features are learned end-to-end from raw inputs
LLMDL model (transformer-based) pre-trained on internet-scale text via next-token predictionScale (≥7B params, ≥1T tokens) + emergent capabilities: in-context learning, chain-of-thought, tool use

Where does Reinforcement Learning (RL) fit?

RL is a learning paradigm, not a model family. An RL agent learns by trial-and-error with delayed reward signals rather than from labelled examples. It overlaps classical ML (tabular Q-learning) and DL (deep Q-network, neural policy gradient). In the LLM context, RL appears in post-training: RLHF and GRPO use policy-gradient algorithms to push the base model toward preferred outputs, making RL an essential ingredient of ChatGPT-class alignment. Parts XVII and IX cover this in depth.

Why is "AGI" slippery?

Artificial General Intelligence has no agreed mathematical definition. Common informal definitions differ on the test (Turing test? all human cognitive tasks? ARC-AGI benchmark?), the subject (human-level? expert-level?), and the timescale (stateless test? lifelong learning?). Frontier labs use the term differently: OpenAI defines AGI as "highly autonomous systems that outperform humans at most economically valuable work"; Anthropic focuses on "transformative AI"; Google DeepMind uses AGI as an internal milestone framework. For this curriculum, treat AGI as a direction of travel, not a crisp target.

3. Sub-fields & Transformer Adoption

Since 2020, the transformer architecture has colonized every major AI sub-field. The diagram below shows the adoption sequence; each node gives the canonical first transformer-dominant paper in that domain.

Sub-fieldPre-transformer SOTATransformer inflection pointState of art 2026
NLP / LLMLSTM seq2seq, ELMo, word2vecBERT 2018, GPT-1 2018GPT-4o, Claude 3.5, Gemini 2.5 Pro
Computer VisionResNet, EfficientNet, InceptionViT 2020 (arXiv:2010.11929)SAM-2, Florence-2, InternViT
Speech / ASRHMM-DNN, wav2vec 1.0, DeepSpeechConformer 2020, Whisper 2022Whisper-large-v3, SeamlessM4T-v2
Code GenerationStatistical models, tree-LSTMCodex 2021, GitHub CopilotDeepSeek-Coder-V2, claude-sonnet-4-6
Reinforcement LearningDQN, PPO, SAC, MuZeroDecision Transformer 2021GRPO (DeepSeek-R1), process reward models
MultimodalCNN + RNN captioning, VQA modelsCLIP 2021, Flamingo 2022GPT-4o, Gemini 2.5 Pro, LLaVA-1.6
Computational BiologyCo-evolutionary stats, RoseTTAFoldAlphaFold-2 2021 (Nature 596)ESM-3, AlphaFold-3, Evo 2024
Embodied AIModel-predictive control, SAC roboticsRT-2 2023 (VLM + robot actions)Pi-0, Octo, RoboFlamingo, GROOT

4. Industry Landscape (2026)

The LLM industry has consolidated around a handful of frontier labs — organizations that can afford the compute to train at the frontier (>$50M per run) and have the talent density to push architecture, post-training, and systems simultaneously.

4.1 United States & Europe

4.2 China

Chinese labs have closed the frontier gap rapidly. DeepSeek-V3 (2024) matched GPT-4-class quality at a reported training cost of <$6M; DeepSeek-R1 matched o1-class reasoning. The ecosystem is structurally different from the US: domestic compute (Ascend, Cambricon) + heavy quantization pressure from US export controls drives aggressive efficiency research (MLA, FP8 training, mixture-of-experts routing innovations).

4.3 Open-Source Ecosystem

The open-source layer wraps around every frontier lab:

  • Hugging Face — model hub, transformers library, datasets, Spaces; de-facto model distribution layer
  • vLLM — PagedAttention-based inference server; de-facto production serving engine (Part X)
  • SGLang — structured generation with fast prefix-caching; competitor to vLLM at scale
  • LangChain / LlamaIndex — RAG pipelines and agent orchestration (Parts XVIII, XIX)
  • Triton — Python DSL for GPU kernels; Inductor backend in torch.compile (Part XIV)
  • PyTorch / JAX — training frameworks; PyTorch dominates research, JAX at Google
  • Megatron-LM — NVIDIA's reference for 3D-parallel pre-training (Part XI)
  • DeepSpeed — Microsoft's ZeRO optimizer and pipeline-parallel framework (Part XI)

4.4 Hardware Ecosystem

NVIDIA holds ~80% of AI training compute through the CUDA moat: NCCL collectives, cuBLAS/cuDNN kernel libraries, FlashAttention CUDA extensions, Megatron-LM, DeepSpeed, and all of torch.cuda are optimized for NVIDIA hardware first.

5. Job Map & Curriculum Priority

Different roles hire on very different subsets of this curriculum. The diagram below shows the primary career paths; the table maps each role to the curriculum parts that interviewers will actually quiz you on.

RoleMust Master (★★★)Important (★★)Helpful (★)
Research ScientistII, IV, VI, VII, VIII, IX, XVII, XXI, III, V, XIV, XVI, XVIII, XXI, XXIIX, XI, XII, XIII, XV, XIX
Research EngineerIV, VI, VII, IX, XX, XXIII, II, VIII, X, XI, XIV, XVII, XXIIII, V, XII, XIII, XV, XVI, XVIII, XIX
ML Infra EngineerVI, VII, X, XI, XII, XIII, XIV, XV, XXIII, II, IV, VIII, IX, XIX, XXIIII, V, XVI, XVII, XVIII, XX
GPU / Kernel EngineerVI, XII, XIII, XIV, XXIIII, IV, VII, VIII, X, XI, XVI, III, V, IX, XVI, XVII, XVIII, XIX, XX, XXI
Applied AI / ProductIX, X, XVI, XVIII, XIX, XXI, XXIII, III, IV, V, VI, VIII, XVII, XXII, VII, XI, XII, XIII, XIV, XV
Safety / AlignmentIX, XVII, XX, XXIII, II, IV, VI, VII, VIII, XXIX, XI, XVI, XVIII, XIX

6. Minimal Demo — Job-Role Skill Radar

Enter a role number (1–4) on stdin and the program prints a priority bar for all 22 curriculum parts. Try switching roles to see how different the study plans are: a Kernel Engineer (3) needs almost no RAG knowledge, while an Applied AI Engineer (4) barely needs CUDA depth.

Input format: a single integer — 1 Research Scientist  2 ML Infra  3 GPU/Kernel  4 Applied AI
Job-Role Skill Radar — C Demo
stdin (optional)

7. Production & Source Pointers

Where to find the canonical implementations of the concepts introduced on this page:

  • PyTorch nn.Transformer torch/nn/modules/transformer.py. The canonical reference for multi-head attention + encoder/decoder.
  • Hugging Face transformers src/transformers/models/— one sub-directory per architecture (llama, gpt2, mistral, gemma, qwen2, …).
  • vLLM engine vllm/engine/llm_engine.pyLLMEngine class; read to understand continuous-batching + PagedAttention glue.
  • Megatron-LM megatron/core/tensor_parallel/— column / row parallel linear layers as described in Narayanan et al. 2021.
  • FlashAttention flash_attn/flash_attn_interface.py; CUDA kernel in csrc/flash_attn/.
  • nanoGPT model.py (github.com/karpathy/nanoGPT) — 300-line GPT-2 you can read in one sitting.
  • DeepSeek-V3 inference/model.py (github.com/deepseek-ai/DeepSeek-V3) — MLA attention + MoE routing in ~800 lines.

8. References

Papers

  • Vaswani et al. 2017 — Attention is All You Need. arXiv:1706.03762
  • Dosovitskiy et al. 2020 — An Image is Worth 16×16 Words (ViT). arXiv:2010.11929
  • Radford et al. 2021 — Learning Transferable Visual Models From Natural Language Supervision (CLIP). arXiv:2103.00020
  • Gulati et al. 2020 — Conformer: Convolution-augmented Transformer for Speech Recognition. arXiv:2005.08100
  • Chen et al. 2021 — Decision Transformer: Reinforcement Learning via Sequence Modeling. arXiv:2106.01345
  • Jumper et al. 2021 — Highly accurate protein structure prediction with AlphaFold. Nature 596.
  • Brown et al. 2020 — Language Models are Few-Shot Learners (GPT-3). arXiv:2005.14165
  • Ouyang et al. 2022 — Training language models to follow instructions with human feedback (InstructGPT). arXiv:2203.02155
  • DeepSeek-AI 2024 — DeepSeek-V3 Technical Report. arXiv:2412.19437
  • DeepSeek-AI 2025 — DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via RL. arXiv:2501.12948
  • Shazeer et al. 2017 — Outrageously Large Neural Networks: The Sparsely-Gated MoE Layer. arXiv:1701.06538

Lectures

  • Stanford CS221 (Intro to AI) — Percy Liang — covers AI taxonomy and sub-field survey
  • Stanford CS224n (NLP with Deep Learning) — unit 1: from classical NLP to transformers
  • Stanford CS336 (Building LLMs from Scratch) — lecture 1: full LLM landscape overview
  • MIT 6.S191 (Intro to Deep Learning) — lecture 1: DL ecosystem and industry landscape
  • Berkeley CS294-196 (LLM Agents) — lecture 1: LLM landscape and agent paradigms
  • Karpathy "State of GPT" (Microsoft Build 2023) — 90-minute pipeline walkthrough
  • Oxford Deep Learning for NLP (Hilary term) — tutorial 1: overview of NLP sub-fields

Textbooks

  • Russell & Norvig — Artificial Intelligence: A Modern Approach (4th ed.) — Part I: overview; Appendix: AI sub-fields
  • Goodfellow, Bengio, Courville — Deep Learning (MIT Press 2016) — Chapter 1: Introduction and landscape
  • Jurafsky & Martin — Speech and Language Processing (3rd ed. draft) — Chapter 1: NLP overview

Blog Posts & Reports

  • AI Index 2024 Annual Report (Stanford HAI) — compute trends, lab landscape, benchmark progress
  • Anthropic "Core Views on AI Safety" (anthropic.com) — mission and AGI-risk framing
  • OpenAI Charter (openai.com/charter) — AGI definition used by OpenAI
  • Epoch AI "Machine Learning Hardware" tracker — GPU capability and cost over time
  • Semianalysis "The AI Chip Landscape" — deep dives on NVIDIA / AMD / custom silicon economics

Code & Repos

  • github.com/huggingface/transformers — all major LLM architectures in one repo
  • github.com/vllm-project/vllm — PagedAttention inference server
  • github.com/NVIDIA/Megatron-LM — 3D-parallel pre-training reference
  • github.com/karpathy/nanoGPT — 300-line GPT-2 reference implementation
  • github.com/deepseek-ai/DeepSeek-V3 — MLA + MoE architecture in production

9. Interview Prep

These questions appear in ML breadth / system-design screens across frontier labs:

1. Draw the AI ⊇ ML ⊇ DL ⊇ LLM containment diagram. Where does RL sit?

Answer: RL sits inside ML (it is a learning paradigm), and overlaps DL when neural networks are used as value or policy functions. In the LLM context, RL appears outside the base pre-training loop — as a post-training step (RLHF, GRPO) that shifts the model's output distribution toward preferred outputs using a reward signal.

2. Compare symbolic AI, connectionism, and statistical ML on data hunger, interpretability, and generalization.

Answer:
Symbolic AI: no training data (hand-coded rules); fully interpretable (rule traces); brittle generalization (rule coverage = reachability).
Connectionism (DL): data-hungry (millions+ examples); largely uninterpretable (mechanistic interp is open research); strong generalization at scale.
Statistical ML: moderate data need (thousands to millions); feature-level interpretability (coefficients, SHAP); best at small-data, structured regimes.

3. Name one architectural innovation each of these labs is primarily known for: OpenAI, Anthropic, DeepSeek, Meta AI.

Answer:
OpenAI: RLHF pipeline (InstructGPT 2022) and inference-time compute scaling (o1, 2024).
Anthropic: Constitutional AI — self-critique RLHF loop (CAI 2022) and claude-sonnet-4-6 successor models.
DeepSeek: Multi-head Latent Attention (MLA, 2024) reducing KV-cache ~93%; MoE training at <$6M reported cost (V3 2024).
Meta AI: RoPE + GQA + SwiGLU + RMSNorm combination in LLaMA — now the industry-default open-weight architecture.

4. Why does NVIDIA hold ~80% of AI training compute despite AMD having competitive hardware specs?

Answer: The CUDA software moat. NCCL collectives, cuBLAS/cuDNN kernels, FlashAttention CUDA extensions, Megatron-LM, DeepSpeed, and all of torch.cuda are optimized for and validated on NVIDIA hardware first. AMD's ROCm has improved but lags by 1–2 years of kernel-level optimization. Switching cost = revalidating every collective primitive, mixed-precision path, and third-party kernel.

5. What hardware advantage does Groq have over NVIDIA H100 for LLM inference, and what is the trade-off?

Answer: Groq's LPU uses SRAM only (no HBM), eliminating the memory-bandwidth bottleneck of autoregressive decode. Token latency is <1 ms vs ~10 ms on H100 for a 70B model. The trade-off: SRAM capacity is tiny (~220 MB per chip), so large models require many chips in a pipeline — increasing per-token cost and limiting batch size compared to an H100 with 80 GB HBM.

6. Explain the difference between NVLink and InfiniBand. When would you choose each?

Answer: NVLink / NVSwitch is an intra-node interconnect (up to 8 GPUs per node, 900 GB/s all-to-all), using a proprietary NVIDIA protocol. InfiniBand NDR is an inter-node fabric (400 Gbps per GPU port, RDMA-capable) connecting thousands of nodes in a training cluster. Use NVLink for tensor-parallel splits within a single node; use InfiniBand for all-reduce and pipeline-parallel traffic across nodes.

7. A candidate says they want to "work on LLMs." What clarifying questions determine which role they are suited for?

Answer (key questions):

  • Do you write papers or ship systems? → Research Scientist vs Research Engineer.
  • Have you written CUDA kernels or profiled GPU occupancy? → Kernel Engineer vs ML Engineer.
  • Have you set up distributed training across multiple nodes? → ML Infra vs Applied AI.
  • Do you use LLMs as a tool to build products, or do you train/fine-tune them? → Applied vs Research.
  • What excites you more: a 20% accuracy gain or a 5× throughput gain? → Research vs Systems orientation.