Part XXIII - AI Training Networks

31. AI Training Fabrics

NVLink and NVSwitch domains, GPUDirect paths, rail-optimized RoCE, EFA/SRD, parallelism traffic, and modern datacenter congestion control.

1. Fabric Layers in a Training Cluster

AI training networks are a hierarchy. NVLink solves the intra-node tensor-parallel hot path, PCIe and NIC affinity decide whether data reaches the wire efficiently, and the scale-out Ethernet or InfiniBand-like fabric carries data-parallel, expert-parallel, and pipeline traffic across nodes.

3. GPUDirect RDMA and Storage

GPUDirect removes bounce buffers. With RDMA, the RNIC DMA-reads or writes GPU memory through peer mappings; with Storage, NVMe data moves through the cuFile path into GPU buffers. The critical dependency is that PCIe topology, IOMMU configuration, and kernel modules allow peer-to-peer transactions.

  • nvidia-peermem exposes GPU memory registration to RDMA-capable NICs.
  • ibv_reg_mr() can register a GPU buffer when the peer memory path is working.
  • cuFileRead() and cuFileWrite() drive GPUDirect Storage.
  • NUMA placement still matters: a GPU behind one PCIe root complex and a NIC behind another can lose much of the expected gain.

Minimal C Demo - GPUDirect Copy Savings

GPUDirect Data Path Comparison — C Demo
stdin (optional)

4. Rail-Optimized RoCE and Spectrum-X

In a rail-optimized Clos, each GPU rank uses a dedicated NIC and every same-rank NIC across nodes lands on the same rail. Data-parallel AllReduce for GPU0 can stay on rail 0, GPU1 on rail 1, and so on. Expert-parallel AllToAll breaks that neat shape and drives cross-rail bisection demand.

Spectrum-X combines Spectrum switches, ConnectX NICs, RoCE congestion control, adaptive routing, and SHARP-style aggregation. The core idea is to stop large training flows from getting stuck on unlucky ECMP hashes.

Background - Why Rails Match Rings

NCCL commonly builds rings by local GPU index. If every node's GPU0 NIC is wired to the same rail, the ring across GPU0 ranks can consume one leaf group without bouncing through unrelated rails. That saves spine bandwidth for traffic patterns that truly need all-to-all reach.

Minimal C Demo - Rail Traffic Split

Rail-Optimized Fabric Explorer — C Demo
stdin (optional)

5. Meta RoCE Backend and AWS EFA

Large Ethernet AI fabrics split into two broad philosophies. RoCE deployments such as Meta-style backend fabrics use RDMA verbs, DCQCN, ECN, careful PFC design, and switch OS integration to keep queues controlled. AWS EFA uses SRD, a reliable datagram transport that stripes packets across many paths, resequences them, and retransmits loss without requiring lossless Ethernet.

FabricTransport ideaOperational trade-off
RoCE v2 backendRDMA over UDP/IP with ECN/DCQCN and often PFC.Very fast, but lossless behavior and pause storms must be engineered carefully.
Meta-style RoCECustom NCCL net plugin over verbs, switch telemetry, FBOSS-driven operations.Open Ethernet control, but congestion management is a first-class platform problem.
AWS EFA / SRDReliable datagram with multipath, out-of-order delivery, resequencing, and retransmit.Avoids PFC dependency, but requires EFA-aware MPI/NCCL plugins.

6. AI Traffic Patterns: DP, TP, PP, and EP

The parallelism strategy determines the fabric bottleneck. Tensor parallelism is a per-layer, latency-sensitive exchange and belongs inside NVLink. Data parallelism produces large gradient AllReduce traffic across nodes. Pipeline parallelism sends activations between stages. Expert parallelism creates bursty AllToAll token exchange.

FSDP changes the communication schedule. DDP keeps full parameters and synchronizes gradient buckets; FSDP shards parameters, gradients, and optimizer state, then performs layer-wise AllGather and ReduceScatter to trade network work for much lower memory pressure.

Minimal C Demo - Parallelism Strategy Explorer

DP / TP / PP Communication Model — C Demo
stdin (optional)

Minimal C Demo - AllReduce Bandwidth Demand

AllReduce Bandwidth Demand Calculator — C Demo
stdin (optional)

7. HPCC, Swift, and Parameter Disaggregation

DCQCN infers congestion from ECN marks; HPCC uses INT telemetry to make the sender compute the bottleneck almost directly; Swift uses fabric RTT and a target delay to keep queues shallow without per-packet telemetry. All three exist because synchronized training turns small congestion errors into visible step-time tail latency.

Control loopFeedbackQueue behaviorPFC dependency
DCQCNECN marks plus CNP feedback.Good for RoCE, but queues can still build during incast.Often deployed with PFC.
HPCCINT queue, rate, and timestamp telemetry.Targets near-zero queues with precise rate control.No inherent PFC requirement.
SwiftEnd-to-end fabric delay.Keeps RTT near a configured target delay.No PFC requirement.

Parameter disaggregation separates where parameters live from where GPU compute runs. It can help heterogeneous clusters, but it shifts the burden to high-bisection RDMA paths between workers and parameter storage or servers.

8. Source Pointers

  • NCCL_ALGO=NVLS - check whether NVLink SHARP is selected on supported HGX systems.
  • nvidia-smi topo -m - inspect GPU, NIC, PCIe, NUMA, and NVLink locality.
  • ibv_devinfo and rdma link - confirm RDMA devices and link state.
  • aws-ofi-nccl - AWS EFA NCCL transport plugin for SRD-backed collectives.
  • cuFile samples - validate GPUDirect Storage path and fallback behavior.

9. Interview Prep

QuestionConcise answer
What does NVSwitch add beyond NVLink?A non-blocking GPU fabric inside the node, plus SHARP/NVLS reductions on supported generations.
What copy does GPUDirect RDMA eliminate?The GPU-to-host bounce copy; the NIC can DMA directly to or from GPU memory.
Why does rail optimization help AllReduce?Same-index GPU rings stay on same-index NIC rails, reducing cross-rail spine traffic.
How is EFA/SRD different from RoCE?SRD allows multipath out-of-order delivery with resequencing and retransmission, so it does not need a lossless PFC fabric.
Where should TP, PP, DP, and EP be placed?TP inside NVLink, PP across nearby stages, DP across the scale-out fabric, and EP where all-to-all bisection is strongest.
How does HPCC avoid large queues?Switches add INT telemetry and the sender adjusts rate from measured bottleneck state instead of waiting for coarse loss or delay signals.