§ 1.2 AI Taxonomy & Industry Landscape
Where LLMs sit in the broader AI family; how the Transformer colonized every sub-field; who the frontier labs are in 2026; and which curriculum parts each role actually needs.
1. Overview
Before studying LLMs in depth, you need a mental map: where they sit in the broader AI family, which research communities created them, who is building them commercially today, and which curriculum parts you need based on the career you are targeting. This page answers all four questions. Read it once now; the job table in §5 will make more sense after you have worked through Parts IV–X.
The key insight from the diagram: LLMs are deep learning applied to text at scale. Reinforcement Learning (RL) overlaps both DL and classical ML; it reappears in post-training (RLHF, GRPO) where it is used to align LLMs to human preferences. Symbolic AI remains relevant as the verifier in neuro-symbolic systems and in formal-methods tooling.
2. AI vs ML vs DL vs LLM
These terms are not synonymous: each inner set is a strict subset of the outer one. The distinctions matter in practice because they determine what kind of data, compute, and expertise each approach requires.
| Term | Definition | Key distinguishing property from parent |
|---|---|---|
| AI | Any technique that makes a machine exhibit intelligent behaviour | Includes rule-based expert systems and logic solvers — no learning required |
| ML | AI that improves from data rather than hand-coded rules | Requires a training set; model parameters are optimized by an objective function |
| DL | ML using multi-layer neural networks that learn hierarchical representations | No manual feature engineering; features are learned end-to-end from raw inputs |
| LLM | DL model (transformer-based) pre-trained on internet-scale text via next-token prediction | Scale (≥7B params, ≥1T tokens) + emergent capabilities: in-context learning, chain-of-thought, tool use |
Where does Reinforcement Learning (RL) fit?
RL is a learning paradigm, not a model family. An RL agent learns by trial-and-error with delayed reward signals rather than from labelled examples. It overlaps classical ML (tabular Q-learning) and DL (deep Q-network, neural policy gradient). In the LLM context, RL appears in post-training: RLHF and GRPO use policy-gradient algorithms to push the base model toward preferred outputs, making RL an essential ingredient of ChatGPT-class alignment. Parts XVII and IX cover this in depth.
Why is "AGI" slippery?
Artificial General Intelligence has no agreed mathematical definition. Common informal definitions differ on the test (Turing test? all human cognitive tasks? ARC-AGI benchmark?), the subject (human-level? expert-level?), and the timescale (stateless test? lifelong learning?). Frontier labs use the term differently: OpenAI defines AGI as "highly autonomous systems that outperform humans at most economically valuable work"; Anthropic focuses on "transformative AI"; Google DeepMind uses AGI as an internal milestone framework. For this curriculum, treat AGI as a direction of travel, not a crisp target.
3. Sub-fields & Transformer Adoption
Since 2020, the transformer architecture has colonized every major AI sub-field. The diagram below shows the adoption sequence; each node gives the canonical first transformer-dominant paper in that domain.
| Sub-field | Pre-transformer SOTA | Transformer inflection point | State of art 2026 |
|---|---|---|---|
| NLP / LLM | LSTM seq2seq, ELMo, word2vec | BERT 2018, GPT-1 2018 | GPT-4o, Claude 3.5, Gemini 2.5 Pro |
| Computer Vision | ResNet, EfficientNet, Inception | ViT 2020 (arXiv:2010.11929) | SAM-2, Florence-2, InternViT |
| Speech / ASR | HMM-DNN, wav2vec 1.0, DeepSpeech | Conformer 2020, Whisper 2022 | Whisper-large-v3, SeamlessM4T-v2 |
| Code Generation | Statistical models, tree-LSTM | Codex 2021, GitHub Copilot | DeepSeek-Coder-V2, claude-sonnet-4-6 |
| Reinforcement Learning | DQN, PPO, SAC, MuZero | Decision Transformer 2021 | GRPO (DeepSeek-R1), process reward models |
| Multimodal | CNN + RNN captioning, VQA models | CLIP 2021, Flamingo 2022 | GPT-4o, Gemini 2.5 Pro, LLaVA-1.6 |
| Computational Biology | Co-evolutionary stats, RoseTTAFold | AlphaFold-2 2021 (Nature 596) | ESM-3, AlphaFold-3, Evo 2024 |
| Embodied AI | Model-predictive control, SAC robotics | RT-2 2023 (VLM + robot actions) | Pi-0, Octo, RoboFlamingo, GROOT |
4. Industry Landscape (2026)
The LLM industry has consolidated around a handful of frontier labs — organizations that can afford the compute to train at the frontier (>$50M per run) and have the talent density to push architecture, post-training, and systems simultaneously.
4.1 United States & Europe
4.2 China
Chinese labs have closed the frontier gap rapidly. DeepSeek-V3 (2024) matched GPT-4-class quality at a reported training cost of <$6M; DeepSeek-R1 matched o1-class reasoning. The ecosystem is structurally different from the US: domestic compute (Ascend, Cambricon) + heavy quantization pressure from US export controls drives aggressive efficiency research (MLA, FP8 training, mixture-of-experts routing innovations).
4.3 Open-Source Ecosystem
The open-source layer wraps around every frontier lab:
- Hugging Face — model hub,
transformerslibrary, datasets, Spaces; de-facto model distribution layer - vLLM — PagedAttention-based inference server; de-facto production serving engine (Part X)
- SGLang — structured generation with fast prefix-caching; competitor to vLLM at scale
- LangChain / LlamaIndex — RAG pipelines and agent orchestration (Parts XVIII, XIX)
- Triton — Python DSL for GPU kernels; Inductor backend in
torch.compile(Part XIV) - PyTorch / JAX — training frameworks; PyTorch dominates research, JAX at Google
- Megatron-LM — NVIDIA's reference for 3D-parallel pre-training (Part XI)
- DeepSpeed — Microsoft's ZeRO optimizer and pipeline-parallel framework (Part XI)
4.4 Hardware Ecosystem
NVIDIA holds ~80% of AI training compute through the CUDA moat: NCCL collectives, cuBLAS/cuDNN kernel libraries, FlashAttention CUDA extensions, Megatron-LM, DeepSpeed, and all of torch.cuda are optimized for NVIDIA hardware first.
5. Job Map & Curriculum Priority
Different roles hire on very different subsets of this curriculum. The diagram below shows the primary career paths; the table maps each role to the curriculum parts that interviewers will actually quiz you on.
| Role | Must Master (★★★) | Important (★★) | Helpful (★) |
|---|---|---|---|
| Research Scientist | II, IV, VI, VII, VIII, IX, XVII, XX | I, III, V, XIV, XVI, XVIII, XXI, XXII | X, XI, XII, XIII, XV, XIX |
| Research Engineer | IV, VI, VII, IX, XX, XXII | I, II, VIII, X, XI, XIV, XVII, XXI | III, V, XII, XIII, XV, XVI, XVIII, XIX |
| ML Infra Engineer | VI, VII, X, XI, XII, XIII, XIV, XV, XXII | I, II, IV, VIII, IX, XIX, XXI | III, V, XVI, XVII, XVIII, XX |
| GPU / Kernel Engineer | VI, XII, XIII, XIV, XXII | II, IV, VII, VIII, X, XI, XV | I, III, V, IX, XVI, XVII, XVIII, XIX, XX, XXI |
| Applied AI / Product | IX, X, XVI, XVIII, XIX, XXI, XXII | I, III, IV, V, VI, VIII, XVII, XX | II, VII, XI, XII, XIII, XIV, XV |
| Safety / Alignment | IX, XVII, XX, XXII | I, II, IV, VI, VII, VIII, XXI | X, XI, XVI, XVIII, XIX |
6. Minimal Demo — Job-Role Skill Radar
Enter a role number (1–4) on stdin and the program prints a priority bar for all 22 curriculum parts. Try switching roles to see how different the study plans are: a Kernel Engineer (3) needs almost no RAG knowledge, while an Applied AI Engineer (4) barely needs CUDA depth.
1 Research Scientist 2 ML Infra 3 GPU/Kernel 4 Applied AI7. Production & Source Pointers
Where to find the canonical implementations of the concepts introduced on this page:
- PyTorch
nn.Transformer—torch/nn/modules/transformer.py. The canonical reference for multi-head attention + encoder/decoder. - Hugging Face
transformers—src/transformers/models/— one sub-directory per architecture (llama, gpt2, mistral, gemma, qwen2, …). - vLLM engine —
vllm/engine/llm_engine.py—LLMEngineclass; read to understand continuous-batching + PagedAttention glue. - Megatron-LM —
megatron/core/tensor_parallel/— column / row parallel linear layers as described in Narayanan et al. 2021. - FlashAttention —
flash_attn/flash_attn_interface.py; CUDA kernel incsrc/flash_attn/. - nanoGPT —
model.py(github.com/karpathy/nanoGPT) — 300-line GPT-2 you can read in one sitting. - DeepSeek-V3 —
inference/model.py(github.com/deepseek-ai/DeepSeek-V3) — MLA attention + MoE routing in ~800 lines.
8. References
Papers
- Vaswani et al. 2017 — Attention is All You Need. arXiv:1706.03762
- Dosovitskiy et al. 2020 — An Image is Worth 16×16 Words (ViT). arXiv:2010.11929
- Radford et al. 2021 — Learning Transferable Visual Models From Natural Language Supervision (CLIP). arXiv:2103.00020
- Gulati et al. 2020 — Conformer: Convolution-augmented Transformer for Speech Recognition. arXiv:2005.08100
- Chen et al. 2021 — Decision Transformer: Reinforcement Learning via Sequence Modeling. arXiv:2106.01345
- Jumper et al. 2021 — Highly accurate protein structure prediction with AlphaFold. Nature 596.
- Brown et al. 2020 — Language Models are Few-Shot Learners (GPT-3). arXiv:2005.14165
- Ouyang et al. 2022 — Training language models to follow instructions with human feedback (InstructGPT). arXiv:2203.02155
- DeepSeek-AI 2024 — DeepSeek-V3 Technical Report. arXiv:2412.19437
- DeepSeek-AI 2025 — DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via RL. arXiv:2501.12948
- Shazeer et al. 2017 — Outrageously Large Neural Networks: The Sparsely-Gated MoE Layer. arXiv:1701.06538
Lectures
- Stanford CS221 (Intro to AI) — Percy Liang — covers AI taxonomy and sub-field survey
- Stanford CS224n (NLP with Deep Learning) — unit 1: from classical NLP to transformers
- Stanford CS336 (Building LLMs from Scratch) — lecture 1: full LLM landscape overview
- MIT 6.S191 (Intro to Deep Learning) — lecture 1: DL ecosystem and industry landscape
- Berkeley CS294-196 (LLM Agents) — lecture 1: LLM landscape and agent paradigms
- Karpathy "State of GPT" (Microsoft Build 2023) — 90-minute pipeline walkthrough
- Oxford Deep Learning for NLP (Hilary term) — tutorial 1: overview of NLP sub-fields
Textbooks
- Russell & Norvig — Artificial Intelligence: A Modern Approach (4th ed.) — Part I: overview; Appendix: AI sub-fields
- Goodfellow, Bengio, Courville — Deep Learning (MIT Press 2016) — Chapter 1: Introduction and landscape
- Jurafsky & Martin — Speech and Language Processing (3rd ed. draft) — Chapter 1: NLP overview
Blog Posts & Reports
- AI Index 2024 Annual Report (Stanford HAI) — compute trends, lab landscape, benchmark progress
- Anthropic "Core Views on AI Safety" (anthropic.com) — mission and AGI-risk framing
- OpenAI Charter (openai.com/charter) — AGI definition used by OpenAI
- Epoch AI "Machine Learning Hardware" tracker — GPU capability and cost over time
- Semianalysis "The AI Chip Landscape" — deep dives on NVIDIA / AMD / custom silicon economics
Code & Repos
- github.com/huggingface/transformers — all major LLM architectures in one repo
- github.com/vllm-project/vllm — PagedAttention inference server
- github.com/NVIDIA/Megatron-LM — 3D-parallel pre-training reference
- github.com/karpathy/nanoGPT — 300-line GPT-2 reference implementation
- github.com/deepseek-ai/DeepSeek-V3 — MLA + MoE architecture in production
9. Interview Prep
These questions appear in ML breadth / system-design screens across frontier labs:
1. Draw the AI ⊇ ML ⊇ DL ⊇ LLM containment diagram. Where does RL sit?
Answer: RL sits inside ML (it is a learning paradigm), and overlaps DL when neural networks are used as value or policy functions. In the LLM context, RL appears outside the base pre-training loop — as a post-training step (RLHF, GRPO) that shifts the model's output distribution toward preferred outputs using a reward signal.
2. Compare symbolic AI, connectionism, and statistical ML on data hunger, interpretability, and generalization.
Answer:
Symbolic AI: no training data (hand-coded rules); fully interpretable (rule traces); brittle generalization (rule coverage = reachability).
Connectionism (DL): data-hungry (millions+ examples); largely uninterpretable (mechanistic interp is open research); strong generalization at scale.
Statistical ML: moderate data need (thousands to millions); feature-level interpretability (coefficients, SHAP); best at small-data, structured regimes.
3. Name one architectural innovation each of these labs is primarily known for: OpenAI, Anthropic, DeepSeek, Meta AI.
Answer:
OpenAI: RLHF pipeline (InstructGPT 2022) and inference-time compute scaling (o1, 2024).
Anthropic: Constitutional AI — self-critique RLHF loop (CAI 2022) and claude-sonnet-4-6 successor models.
DeepSeek: Multi-head Latent Attention (MLA, 2024) reducing KV-cache ~93%; MoE training at <$6M reported cost (V3 2024).
Meta AI: RoPE + GQA + SwiGLU + RMSNorm combination in LLaMA — now the industry-default open-weight architecture.
4. Why does NVIDIA hold ~80% of AI training compute despite AMD having competitive hardware specs?
Answer: The CUDA software moat. NCCL collectives, cuBLAS/cuDNN kernels, FlashAttention CUDA extensions, Megatron-LM, DeepSpeed, and all of torch.cuda are optimized for and validated on NVIDIA hardware first. AMD's ROCm has improved but lags by 1–2 years of kernel-level optimization. Switching cost = revalidating every collective primitive, mixed-precision path, and third-party kernel.
5. What hardware advantage does Groq have over NVIDIA H100 for LLM inference, and what is the trade-off?
Answer: Groq's LPU uses SRAM only (no HBM), eliminating the memory-bandwidth bottleneck of autoregressive decode. Token latency is <1 ms vs ~10 ms on H100 for a 70B model. The trade-off: SRAM capacity is tiny (~220 MB per chip), so large models require many chips in a pipeline — increasing per-token cost and limiting batch size compared to an H100 with 80 GB HBM.
6. Explain the difference between NVLink and InfiniBand. When would you choose each?
Answer: NVLink / NVSwitch is an intra-node interconnect (up to 8 GPUs per node, 900 GB/s all-to-all), using a proprietary NVIDIA protocol. InfiniBand NDR is an inter-node fabric (400 Gbps per GPU port, RDMA-capable) connecting thousands of nodes in a training cluster. Use NVLink for tensor-parallel splits within a single node; use InfiniBand for all-reduce and pipeline-parallel traffic across nodes.
7. A candidate says they want to "work on LLMs." What clarifying questions determine which role they are suited for?
Answer (key questions):
- Do you write papers or ship systems? → Research Scientist vs Research Engineer.
- Have you written CUDA kernels or profiled GPU occupancy? → Kernel Engineer vs ML Engineer.
- Have you set up distributed training across multiple nodes? → ML Infra vs Applied AI.
- Do you use LLMs as a tool to build products, or do you train/fine-tune them? → Applied vs Research.
- What excites you more: a 20% accuracy gain or a 5× throughput gain? → Research vs Systems orientation.