Tech Notes

1. Fabric Layers in a Training Cluster

AI training networks are a hierarchy. NVLink solves the intra-node tensor-parallel hot path, PCIe and NIC affinity decide whether data reaches the wire efficiently, and the scale-out Ethernet or InfiniBand-like fabric carries data-parallel, expert-parallel, and pipeline traffic across nodes.

2. NVLink and NVSwitch

NVLink is the GPU-to-GPU high-bandwidth path; NVSwitch turns those links into a non-blocking all-to-all fabric inside DGX/HGX systems. Hopper-class systems expose roughly 900 GB/s bidirectional bandwidth per GPU across the NVLink domain.

Generation	GPU era	Per-GPU bidirectional bandwidth	Fabric point
NVLink 1	Pascal / P100	160 GB/s	Direct GPU peer links.
NVLink 2	Volta / V100	300 GB/s	NVSwitch appears in DGX-2 class systems.
NVLink 3	Ampere / A100	600 GB/s	Common 8-GPU NVSwitch baseboard topology.
NVLink 4	Hopper / H100	900 GB/s	NVLink SHARP enables in-switch reductions.

NVLink SHARP, selected by NCCL as NVLS on supported systems, lets the switch pipeline perform reductions before broadcasting results back to GPUs. This attacks the largest AllReduce cost: repeated movement of the same partial sums.

Minimal C Demo - NVSwitch AllReduce Model

NVSwitch / NVLS Cost Model — C Demo

stdin (optional)

3. GPUDirect RDMA and Storage

GPUDirect removes bounce buffers. With RDMA, the RNIC DMA-reads or writes GPU memory through peer mappings; with Storage, NVMe data moves through the cuFile path into GPU buffers. The critical dependency is that PCIe topology, IOMMU configuration, and kernel modules allow peer-to-peer transactions.

nvidia-peermem exposes GPU memory registration to RDMA-capable NICs.
ibv_reg_mr() can register a GPU buffer when the peer memory path is working.
cuFileRead() and cuFileWrite() drive GPUDirect Storage.
NUMA placement still matters: a GPU behind one PCIe root complex and a NIC behind another can lose much of the expected gain.

Minimal C Demo - GPUDirect Copy Savings

GPUDirect Data Path Comparison — C Demo

stdin (optional)

4. Rail-Optimized RoCE and Spectrum-X

In a rail-optimized Clos, each GPU rank uses a dedicated NIC and every same-rank NIC across nodes lands on the same rail. Data-parallel AllReduce for GPU0 can stay on rail 0, GPU1 on rail 1, and so on. Expert-parallel AllToAll breaks that neat shape and drives cross-rail bisection demand.

Spectrum-X combines Spectrum switches, ConnectX NICs, RoCE congestion control, adaptive routing, and SHARP-style aggregation. The core idea is to stop large training flows from getting stuck on unlucky ECMP hashes.

Background - Why Rails Match Rings

NCCL commonly builds rings by local GPU index. If every node's GPU0 NIC is wired to the same rail, the ring across GPU0 ranks can consume one leaf group without bouncing through unrelated rails. That saves spine bandwidth for traffic patterns that truly need all-to-all reach.

Minimal C Demo - Rail Traffic Split

Rail-Optimized Fabric Explorer — C Demo

stdin (optional)

5. Meta RoCE Backend and AWS EFA

Large Ethernet AI fabrics split into two broad philosophies. RoCE deployments such as Meta-style backend fabrics use RDMA verbs, DCQCN, ECN, careful PFC design, and switch OS integration to keep queues controlled. AWS EFA uses SRD, a reliable datagram transport that stripes packets across many paths, resequences them, and retransmits loss without requiring lossless Ethernet.

Fabric	Transport idea	Operational trade-off
RoCE v2 backend	RDMA over UDP/IP with ECN/DCQCN and often PFC.	Very fast, but lossless behavior and pause storms must be engineered carefully.
Meta-style RoCE	Custom NCCL net plugin over verbs, switch telemetry, FBOSS-driven operations.	Open Ethernet control, but congestion management is a first-class platform problem.
AWS EFA / SRD	Reliable datagram with multipath, out-of-order delivery, resequencing, and retransmit.	Avoids PFC dependency, but requires EFA-aware MPI/NCCL plugins.

6. AI Traffic Patterns: DP, TP, PP, and EP

The parallelism strategy determines the fabric bottleneck. Tensor parallelism is a per-layer, latency-sensitive exchange and belongs inside NVLink. Data parallelism produces large gradient AllReduce traffic across nodes. Pipeline parallelism sends activations between stages. Expert parallelism creates bursty AllToAll token exchange.

FSDP changes the communication schedule. DDP keeps full parameters and synchronizes gradient buckets; FSDP shards parameters, gradients, and optimizer state, then performs layer-wise AllGather and ReduceScatter to trade network work for much lower memory pressure.

Minimal C Demo - Parallelism Strategy Explorer

DP / TP / PP Communication Model — C Demo

stdin (optional)

Minimal C Demo - AllReduce Bandwidth Demand

AllReduce Bandwidth Demand Calculator — C Demo

stdin (optional)

7. HPCC, Swift, and Parameter Disaggregation

DCQCN infers congestion from ECN marks; HPCC uses INT telemetry to make the sender compute the bottleneck almost directly; Swift uses fabric RTT and a target delay to keep queues shallow without per-packet telemetry. All three exist because synchronized training turns small congestion errors into visible step-time tail latency.

Control loop	Feedback	Queue behavior	PFC dependency
DCQCN	ECN marks plus CNP feedback.	Good for RoCE, but queues can still build during incast.	Often deployed with PFC.
HPCC	INT queue, rate, and timestamp telemetry.	Targets near-zero queues with precise rate control.	No inherent PFC requirement.
Swift	End-to-end fabric delay.	Keeps RTT near a configured target delay.	No PFC requirement.

Parameter disaggregation separates where parameters live from where GPU compute runs. It can help heterogeneous clusters, but it shifts the burden to high-bisection RDMA paths between workers and parameter storage or servers.

8. Source Pointers

NCCL_ALGO=NVLS - check whether NVLink SHARP is selected on supported HGX systems.
nvidia-smi topo -m - inspect GPU, NIC, PCIe, NUMA, and NVLink locality.
ibv_devinfo and rdma link - confirm RDMA devices and link state.
aws-ofi-nccl - AWS EFA NCCL transport plugin for SRD-backed collectives.
cuFile samples - validate GPUDirect Storage path and fallback behavior.

9. Interview Prep

Question	Concise answer
What does NVSwitch add beyond NVLink?	A non-blocking GPU fabric inside the node, plus SHARP/NVLS reductions on supported generations.
What copy does GPUDirect RDMA eliminate?	The GPU-to-host bounce copy; the NIC can DMA directly to or from GPU memory.
Why does rail optimization help AllReduce?	Same-index GPU rings stay on same-index NIC rails, reducing cross-rail spine traffic.
How is EFA/SRD different from RoCE?	SRD allows multipath out-of-order delivery with resequencing and retransmission, so it does not need a lossless PFC fabric.
Where should TP, PP, DP, and EP be placed?	TP inside NVLink, PP across nearby stages, DP across the scale-out fabric, and EP where all-to-all bisection is strongest.
How does HPCC avoid large queues?	Switches add INT telemetry and the sender adjusts rate from measured bottleneck state instead of waiting for coarse loss or delay signals.

31. AI Training Fabrics

1. Fabric Layers in a Training Cluster

2. NVLink and NVSwitch

Minimal C Demo - NVSwitch AllReduce Model

3. GPUDirect RDMA and Storage

Minimal C Demo - GPUDirect Copy Savings

4. Rail-Optimized RoCE and Spectrum-X

Background - Why Rails Match Rings

Minimal C Demo - Rail Traffic Split

5. Meta RoCE Backend and AWS EFA

6. AI Traffic Patterns: DP, TP, PP, and EP

Minimal C Demo - Parallelism Strategy Explorer

Minimal C Demo - AllReduce Bandwidth Demand

7. HPCC, Swift, and Parameter Disaggregation

8. Source Pointers

9. Interview Prep