Tech Notes

1. Where Collectives Sit in Training

Distributed training turns every backward pass into a network problem. Each GPU computes local gradients, buckets them, and calls a collective so all ranks can apply the same optimizer update or exchange tensor shards.

2. Collective Communication Primitives

A collective is defined by ownership before and after the operation: one rank may own the input, every rank may own a shard, or every rank may need the final reduced tensor.

AllReduce is the default gradient synchronization primitive. In a ring implementation each GPU sends 2*(N-1)/N*S bytes for an S-byte tensor.

Ring AllReduce is easier to understand as two collectives: ReduceScatter first creates one reduced shard per rank, then AllGather circulates those reduced shards until every rank has the whole tensor.

AllToAll is the communication shape behind mixture-of-experts token dispatch: rank isends a distinct payload to rank j, creating an N-by-N traffic matrix.

Minimal C Demo - Collective Cost Visualizer

Collective Primitive Cost Model — C Demo

stdin (optional)

3. Ring AllReduce Algorithm

The ring arranges ranks so GPU i always sends to (i+1)%N and receives from (i-1+N)%N. The tensor is split into N chunks so every hop carries one chunk per step.

During ReduceScatter, each received chunk is accumulated with the local copy and forwarded on later steps. After N-1 steps, every rank owns exactly one fully reduced shard.

During AllGather, the reduced shards circulate without further arithmetic. After another N-1 steps, each rank has all shards.

Ring AllReduce is bandwidth-efficient for large tensors because per-GPU transfer approaches 2S, but its 2(N-1) latency steps make trees better for small messages.

Background - Why Ring Wins Large Gradients

Gradient buckets are usually large enough that link bandwidth, not per-hop latency, dominates runtime. The ring keeps every rank sending and receiving in every step, so no single root becomes a bandwidth bottleneck.

Plan

Split the tensor into one chunk per rank.
Run N-1 ReduceScatter steps, accumulating received chunks.
Run N-1 AllGather steps, forwarding reduced shards.
Choose ring for bandwidth-bound large buckets and tree for latency-bound small buckets.

Minimal C Demo - Ring AllReduce Steps

Ring AllReduce Step Model — C Demo

stdin (optional)

Minimal C Demo - Ring vs Tree Crossover

Ring vs Tree Latency Model — C Demo

stdin (optional)

4. NCCL Architecture

NCCL builds a communicator for a fixed set of ranks, discovers topology, creates multiple channels, and assigns a ring, tree, or CollNet/NVLS path per channel. Channels let independent chunks move in parallel.

A call such as ncclAllReduce is asynchronous with respect to the host. NCCL enqueues CUDA work and transport operations onto a stream; completion is observed when that stream is synchronized.

Control	Purpose	Typical use
`NCCL_ALGO`	Force ring, tree, CollNet, or NVLS choices.	Debug algorithm selection or test small-message tree behavior.
`NCCL_PROTO`	Choose LL, LL128, or Simple protocol.	Isolate latency or bandwidth protocol effects.
`NCCL_NCHANNELS`	Control parallel channel count.	Tune bandwidth utilization or reduce GPU scheduling overhead.
`NCCL_DEBUG=INFO`	Print topology, rings, trees, and transport selection.	First tool when collectives are slower than expected.

Minimal C Demo - NCCL Channel Model

NCCL Channel Bandwidth Model — C Demo

stdin (optional)

5. NCCL Protocols: LL, LL128, Simple

NCCL protocol selection trades startup latency against payload efficiency. LL uses tiny flag-data units for fast detection; LL128 packs flags into cache lines; Simple moves larger chunks for peak bandwidth.

The low-latency protocols rely on polling visible flags rather than interrupting the receiver. That is why they can be very fast for small tensors but waste more bandwidth than Simple.

6. MSCCL, RCCL, Gloo, and UCX

NCCL is the usual GPU collective backend on NVIDIA systems, but the ecosystem splits by hardware and flexibility: MSCCL programs custom schedules, RCCL targets AMD ROCm, Gloo covers CPU tensors, and UCX abstracts transports.

Library	Best fit	Important detail
NCCL	NVIDIA GPU collectives	Topology-aware rings, trees, channels, NVLink, RoCE, InfiniBand, TCP fallback.
MSCCL	Hand-optimized algorithms	Custom XML schedules compiled/interpreted into GPU communication kernels.
RCCL	AMD GPU collectives	NCCL-compatible API over ROCm, xGMI/Infinity Fabric, and RoCE.
Gloo	CPU tensors and fallback	Useful when NCCL is unavailable; not the fast path for GPU gradient buckets.
UCX	Transport abstraction	Common in MPI and network plugins; exposes SHM, TCP, InfiniBand, RoCE capabilities.

7. Source Pointers

src/collectives/device/all_reduce.cu - NCCL device-side AllReduce kernels.
src/enqueue.cc - NCCL operation enqueue and grouping path.
src/graph/topo.cc - topology graph construction and path scoring.
src/transport/ - P2P, shared-memory, and network transport implementations.
torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp - PyTorch's NCCL process group integration.

8. Interview Prep

Question	Concise answer
Why is AllReduce central to data-parallel training?	Each rank computes local gradients; AllReduce produces the same global gradient on every rank before the optimizer step.
Why does ring AllReduce send about 2S bytes per GPU?	ReduceScatter sends `(N-1)/NS` and AllGather sends the same amount, totaling `2(N-1)/N*S`.
When does tree beat ring?	Small messages or large rank counts where latency steps dominate; tree uses about `2log2(N)` steps instead of `2(N-1)`.
What are NCCL channels?	Parallel communication lanes; each channel owns part of the tensor and can use a separate ring/tree to improve link utilization.
What is the difference between LL and Simple?	LL uses flag-data polling units for low latency and lower efficiency; Simple uses larger chunks for high peak bandwidth.

30. AI Collectives & NCCL

1. Where Collectives Sit in Training

2. Collective Communication Primitives

Minimal C Demo - Collective Cost Visualizer

3. Ring AllReduce Algorithm

Background - Why Ring Wins Large Gradients

Plan

Minimal C Demo - Ring AllReduce Steps

Minimal C Demo - Ring vs Tree Crossover

4. NCCL Architecture

Minimal C Demo - NCCL Channel Model

5. NCCL Protocols: LL, LL128, Simple

6. MSCCL, RCCL, Gloo, and UCX

7. Source Pointers

8. Interview Prep