Part XXIII - AI Training Networks

30. AI Collectives & NCCL

Collective primitives, ring AllReduce math, NCCL channels and transports, LL protocols, and library choices.

1. Where Collectives Sit in Training

Distributed training turns every backward pass into a network problem. Each GPU computes local gradients, buckets them, and calls a collective so all ranks can apply the same optimizer update or exchange tensor shards.

2. Collective Communication Primitives

A collective is defined by ownership before and after the operation: one rank may own the input, every rank may own a shard, or every rank may need the final reduced tensor.

AllReduce is the default gradient synchronization primitive. In a ring implementation each GPU sends 2*(N-1)/N*S bytes for an S-byte tensor.

Ring AllReduce is easier to understand as two collectives: ReduceScatter first creates one reduced shard per rank, then AllGather circulates those reduced shards until every rank has the whole tensor.

AllToAll is the communication shape behind mixture-of-experts token dispatch: rank isends a distinct payload to rank j, creating an N-by-N traffic matrix.

Minimal C Demo - Collective Cost Visualizer

Collective Primitive Cost Model — C Demo
stdin (optional)

3. Ring AllReduce Algorithm

The ring arranges ranks so GPU i always sends to (i+1)%N and receives from (i-1+N)%N. The tensor is split into N chunks so every hop carries one chunk per step.

During ReduceScatter, each received chunk is accumulated with the local copy and forwarded on later steps. After N-1 steps, every rank owns exactly one fully reduced shard.

During AllGather, the reduced shards circulate without further arithmetic. After another N-1 steps, each rank has all shards.

Ring AllReduce is bandwidth-efficient for large tensors because per-GPU transfer approaches 2S, but its 2(N-1) latency steps make trees better for small messages.

Background - Why Ring Wins Large Gradients

Gradient buckets are usually large enough that link bandwidth, not per-hop latency, dominates runtime. The ring keeps every rank sending and receiving in every step, so no single root becomes a bandwidth bottleneck.

Plan

  1. Split the tensor into one chunk per rank.
  2. Run N-1 ReduceScatter steps, accumulating received chunks.
  3. Run N-1 AllGather steps, forwarding reduced shards.
  4. Choose ring for bandwidth-bound large buckets and tree for latency-bound small buckets.

Minimal C Demo - Ring AllReduce Steps

Ring AllReduce Step Model — C Demo
stdin (optional)

Minimal C Demo - Ring vs Tree Crossover

Ring vs Tree Latency Model — C Demo
stdin (optional)

4. NCCL Architecture

NCCL builds a communicator for a fixed set of ranks, discovers topology, creates multiple channels, and assigns a ring, tree, or CollNet/NVLS path per channel. Channels let independent chunks move in parallel.

A call such as ncclAllReduce is asynchronous with respect to the host. NCCL enqueues CUDA work and transport operations onto a stream; completion is observed when that stream is synchronized.

ControlPurposeTypical use
NCCL_ALGOForce ring, tree, CollNet, or NVLS choices.Debug algorithm selection or test small-message tree behavior.
NCCL_PROTOChoose LL, LL128, or Simple protocol.Isolate latency or bandwidth protocol effects.
NCCL_NCHANNELSControl parallel channel count.Tune bandwidth utilization or reduce GPU scheduling overhead.
NCCL_DEBUG=INFOPrint topology, rings, trees, and transport selection.First tool when collectives are slower than expected.

Minimal C Demo - NCCL Channel Model

NCCL Channel Bandwidth Model — C Demo
stdin (optional)

5. NCCL Protocols: LL, LL128, Simple

NCCL protocol selection trades startup latency against payload efficiency. LL uses tiny flag-data units for fast detection; LL128 packs flags into cache lines; Simple moves larger chunks for peak bandwidth.

The low-latency protocols rely on polling visible flags rather than interrupting the receiver. That is why they can be very fast for small tensors but waste more bandwidth than Simple.

6. MSCCL, RCCL, Gloo, and UCX

NCCL is the usual GPU collective backend on NVIDIA systems, but the ecosystem splits by hardware and flexibility: MSCCL programs custom schedules, RCCL targets AMD ROCm, Gloo covers CPU tensors, and UCX abstracts transports.

LibraryBest fitImportant detail
NCCLNVIDIA GPU collectivesTopology-aware rings, trees, channels, NVLink, RoCE, InfiniBand, TCP fallback.
MSCCLHand-optimized algorithmsCustom XML schedules compiled/interpreted into GPU communication kernels.
RCCLAMD GPU collectivesNCCL-compatible API over ROCm, xGMI/Infinity Fabric, and RoCE.
GlooCPU tensors and fallbackUseful when NCCL is unavailable; not the fast path for GPU gradient buckets.
UCXTransport abstractionCommon in MPI and network plugins; exposes SHM, TCP, InfiniBand, RoCE capabilities.

7. Source Pointers

  • src/collectives/device/all_reduce.cu - NCCL device-side AllReduce kernels.
  • src/enqueue.cc - NCCL operation enqueue and grouping path.
  • src/graph/topo.cc - topology graph construction and path scoring.
  • src/transport/ - P2P, shared-memory, and network transport implementations.
  • torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp - PyTorch's NCCL process group integration.

8. Interview Prep

QuestionConcise answer
Why is AllReduce central to data-parallel training?Each rank computes local gradients; AllReduce produces the same global gradient on every rank before the optimizer step.
Why does ring AllReduce send about 2S bytes per GPU?ReduceScatter sends (N-1)/N*S and AllGather sends the same amount, totaling 2*(N-1)/N*S.
When does tree beat ring?Small messages or large rank counts where latency steps dominate; tree uses about 2log2(N) steps instead of 2(N-1).
What are NCCL channels?Parallel communication lanes; each channel owns part of the tensor and can use a separate ring/tree to improve link utilization.
What is the difference between LL and Simple?LL uses flag-data polling units for low latency and lower efficiency; Simple uses larger chunks for high peak bandwidth.