Part XXII - RDMA

29. RoCE and Lossless Data Centers

RoCE framing, iWARP tradeoffs, PFC headroom, DCQCN congestion control, and RDMA storage flows.

1. RoCE v1 and RoCE v2 Architecture

RoCE carries InfiniBand transport semantics over Ethernet. RoCE v1 is an L2 EtherType and stays inside one broadcast domain; RoCE v2 adds IP and UDP port 4791, so packets can cross a routed spine-leaf fabric.

FieldMeaningWhy operators care
BTHOpcode, QPN, PSN, solicited event bit.Identifies QP and orders/retries packets for RC.
RETHRemote virtual address, rkey, length.Present on WRITE/READ and authorizes remote memory access.
UDP source portHash-derived entropy for RoCE v2.Lets ECMP spread flows while destination stays 4791.
IP ECN bitsECT and CE marks.Switches signal congestion to DCQCN without dropping packets.

Minimal C Demo - RoCE Frame Builder

RoCE Frame Builder — C Demo
stdin (optional)

2. iWARP: RDMA over TCP

iWARP keeps the verbs model but runs over TCP. RDMAP expresses RDMA operations, DDP places data directly into buffers, and MPA preserves message framing over TCP byte streams.

ChoiceFabric requirementLatency/CPU profileTypical fit
RoCE v2Lossless or near-lossless Ethernet with PFC and ECN/DCQCN.Lowest latency, hardware transport, operationally sensitive.AI/HPC, NVMe-oF, high-performance DC fabrics.
iWARPLossy Ethernet is acceptable because TCP recovers loss.Higher overhead and latency, simpler fabric.Commodity Ethernet and SMB Direct deployments where PFC is unavailable.

3. Priority Flow Control

PFC is per-priority PAUSE. When a lossless queue crosses the xoff threshold, the switch asks its upstream neighbor to stop only that priority while best-effort traffic continues.

Headroom is the buffer needed to absorb packets already in flight after the PAUSE frame is generated. If headroom is too small, RoCE still drops and falls into retry behavior.

PFC can also create a circular wait. If every switch in a cycle pauses the next one on the same lossless priority, no queue drains and a watchdog must break the deadlock.

Minimal C Demo - PFC Threshold Calculator

PFC Threshold Calculator — C Demo
stdin (optional)

Minimal C Demo - PFC Deadlock Visualizer

PFC Deadlock Visualizer — C Demo
stdin (optional)

4. ECN and DCQCN

DCQCN is the RoCE congestion-control loop. The switch marks ECN CE, the receiver sends a CNP, and the sender reduces its hardware transmit rate before queues grow into PFC territory.

The sender keeps an alpha estimate of congestion severity. CNPs raise alpha and cut rate; quiet intervals decay alpha and allow additive or hyper increase.

Background - Why PFC Alone Is Not Enough

PFC prevents drops by stopping traffic, but stopping traffic can spread congestion backward through the fabric. DCQCN tries to keep queues below xoff by reducing senders earlier using ECN marks.

Minimal C Demo - DCQCN Rate Controller

DCQCN Rate Controller — C Demo
stdin (optional)

5. Headroom, DCB, and Lossless Priority Planning

A production RoCE fabric combines PFC, ETS, DCBX, ECN thresholds, and explicit priority mapping. The usual design is one lossless RDMA priority and one or more lossy best-effort priorities.

Buffer planning separates reserved lossless headroom from shared lossy buffer. A PAUSE on priority 3 must not freeze ordinary TCP/IP traffic on priority 0.

Minimal C Demo - Headroom Budget

RoCE Headroom Budget — C Demo
stdin (optional)

6. NVMe-oF RDMA, NFS/RDMA, and SMB Direct

RDMA storage keeps control messages as SEND/RECV capsules and moves bulk data with one-sided READ or WRITE. That split keeps command processing explicit while the large payload avoids the kernel socket data path.

Compared with NVMe-oF TCP, the RDMA transport removes much of the socket, copy, and interrupt work from the data movement path. The tradeoff is stricter NIC, memory-registration, and fabric configuration requirements.

ProtocolRDMA roleKernel/user components
NVMe-oF RDMASEND capsules, RDMA READ/WRITE for data.nvme-rdma, nvmet-rdma.
NFS/RDMARPC-over-RDMA uses DDP for NFS payloads.xprtrdma, svcrdma, sunrpc.
SMB DirectSMB 3.x over RDMA NICs for low-copy file and VM migration traffic.Windows SMB multichannel with RoCE or iWARP NICs.

Minimal C Demo - NVMe-oF RDMA I/O Flow

NVMe-oF RDMA I/O Flow — C Demo
stdin (optional)

7. Kernel Source Pointers

  • drivers/infiniband/core/cma.c - RDMA CM address and route resolution.
  • drivers/infiniband/hw/mlx5/ - mlx5 RoCE-capable provider implementation.
  • drivers/nvme/host/rdma.c - NVMe-oF RDMA initiator transport.
  • drivers/nvme/target/rdma.c - NVMe-oF RDMA target transport.
  • net/sunrpc/xprtrdma/ and net/sunrpc/svcrdma/ - NFS/RDMA client and server transports.

8. Interview Prep

QuestionConcise answer
Why does RoCE v2 use UDP?UDP/IP makes RoCE routable and gives ECMP entropy through the source port while preserving IB transport headers above UDP.
What does PFC protect?It protects a selected priority queue from drops by pausing that priority upstream; it does not provide congestion control by itself.
What is PFC headroom?Reserved buffer for packets already in flight after a PAUSE frame is sent until the upstream sender actually stops.
How does DCQCN differ from DCTCP?DCTCP uses TCP ACK ECN feedback; DCQCN uses RoCE CNP packets and NIC rate limiting because there are no TCP ACKs.
NVMe-oF RDMA read path?Host SENDs a command; target reads media; target RDMA_WRITEs host buffer; target SENDs completion.