30. AI Collectives & NCCL
Collective primitives, ring AllReduce math, NCCL channels and transports, LL protocols, and library choices.
1. Where Collectives Sit in Training
Distributed training turns every backward pass into a network problem. Each GPU computes local gradients, buckets them, and calls a collective so all ranks can apply the same optimizer update or exchange tensor shards.
2. Collective Communication Primitives
A collective is defined by ownership before and after the operation: one rank may own the input, every rank may own a shard, or every rank may need the final reduced tensor.
AllReduce is the default gradient synchronization primitive. In a ring implementation each GPU sends 2*(N-1)/N*S bytes for an S-byte tensor.
Ring AllReduce is easier to understand as two collectives: ReduceScatter first creates one reduced shard per rank, then AllGather circulates those reduced shards until every rank has the whole tensor.
AllToAll is the communication shape behind mixture-of-experts token dispatch: rank isends a distinct payload to rank j, creating an N-by-N traffic matrix.
Minimal C Demo - Collective Cost Visualizer
3. Ring AllReduce Algorithm
The ring arranges ranks so GPU i always sends to (i+1)%N and receives from (i-1+N)%N. The tensor is split into N chunks so every hop carries one chunk per step.
During ReduceScatter, each received chunk is accumulated with the local copy and forwarded on later steps. After N-1 steps, every rank owns exactly one fully reduced shard.
During AllGather, the reduced shards circulate without further arithmetic. After another N-1 steps, each rank has all shards.
Ring AllReduce is bandwidth-efficient for large tensors because per-GPU transfer approaches 2S, but its 2(N-1) latency steps make trees better for small messages.
Background - Why Ring Wins Large Gradients
Gradient buckets are usually large enough that link bandwidth, not per-hop latency, dominates runtime. The ring keeps every rank sending and receiving in every step, so no single root becomes a bandwidth bottleneck.
Plan
- Split the tensor into one chunk per rank.
- Run
N-1ReduceScatter steps, accumulating received chunks. - Run
N-1AllGather steps, forwarding reduced shards. - Choose ring for bandwidth-bound large buckets and tree for latency-bound small buckets.
Minimal C Demo - Ring AllReduce Steps
Minimal C Demo - Ring vs Tree Crossover
4. NCCL Architecture
NCCL builds a communicator for a fixed set of ranks, discovers topology, creates multiple channels, and assigns a ring, tree, or CollNet/NVLS path per channel. Channels let independent chunks move in parallel.
A call such as ncclAllReduce is asynchronous with respect to the host. NCCL enqueues CUDA work and transport operations onto a stream; completion is observed when that stream is synchronized.
| Control | Purpose | Typical use |
|---|---|---|
NCCL_ALGO | Force ring, tree, CollNet, or NVLS choices. | Debug algorithm selection or test small-message tree behavior. |
NCCL_PROTO | Choose LL, LL128, or Simple protocol. | Isolate latency or bandwidth protocol effects. |
NCCL_NCHANNELS | Control parallel channel count. | Tune bandwidth utilization or reduce GPU scheduling overhead. |
NCCL_DEBUG=INFO | Print topology, rings, trees, and transport selection. | First tool when collectives are slower than expected. |
Minimal C Demo - NCCL Channel Model
5. NCCL Protocols: LL, LL128, Simple
NCCL protocol selection trades startup latency against payload efficiency. LL uses tiny flag-data units for fast detection; LL128 packs flags into cache lines; Simple moves larger chunks for peak bandwidth.
The low-latency protocols rely on polling visible flags rather than interrupting the receiver. That is why they can be very fast for small tensors but waste more bandwidth than Simple.
6. MSCCL, RCCL, Gloo, and UCX
NCCL is the usual GPU collective backend on NVIDIA systems, but the ecosystem splits by hardware and flexibility: MSCCL programs custom schedules, RCCL targets AMD ROCm, Gloo covers CPU tensors, and UCX abstracts transports.
| Library | Best fit | Important detail |
|---|---|---|
| NCCL | NVIDIA GPU collectives | Topology-aware rings, trees, channels, NVLink, RoCE, InfiniBand, TCP fallback. |
| MSCCL | Hand-optimized algorithms | Custom XML schedules compiled/interpreted into GPU communication kernels. |
| RCCL | AMD GPU collectives | NCCL-compatible API over ROCm, xGMI/Infinity Fabric, and RoCE. |
| Gloo | CPU tensors and fallback | Useful when NCCL is unavailable; not the fast path for GPU gradient buckets. |
| UCX | Transport abstraction | Common in MPI and network plugins; exposes SHM, TCP, InfiniBand, RoCE capabilities. |
7. Source Pointers
src/collectives/device/all_reduce.cu- NCCL device-side AllReduce kernels.src/enqueue.cc- NCCL operation enqueue and grouping path.src/graph/topo.cc- topology graph construction and path scoring.src/transport/- P2P, shared-memory, and network transport implementations.torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp- PyTorch's NCCL process group integration.
8. Interview Prep
| Question | Concise answer |
|---|---|
| Why is AllReduce central to data-parallel training? | Each rank computes local gradients; AllReduce produces the same global gradient on every rank before the optimizer step. |
| Why does ring AllReduce send about 2S bytes per GPU? | ReduceScatter sends (N-1)/N*S and AllGather sends the same amount, totaling 2*(N-1)/N*S. |
| When does tree beat ring? | Small messages or large rank counts where latency steps dominate; tree uses about 2log2(N) steps instead of 2(N-1). |
| What are NCCL channels? | Parallel communication lanes; each channel owns part of the tensor and can use a separate ring/tree to improve link utilization. |
| What is the difference between LL and Simple? | LL uses flag-data polling units for low latency and lower efficiency; Simple uses larger chunks for high peak bandwidth. |