29. RoCE and Lossless Data Centers
RoCE framing, iWARP tradeoffs, PFC headroom, DCQCN congestion control, and RDMA storage flows.
1. RoCE v1 and RoCE v2 Architecture
RoCE carries InfiniBand transport semantics over Ethernet. RoCE v1 is an L2 EtherType and stays inside one broadcast domain; RoCE v2 adds IP and UDP port 4791, so packets can cross a routed spine-leaf fabric.
| Field | Meaning | Why operators care |
|---|---|---|
| BTH | Opcode, QPN, PSN, solicited event bit. | Identifies QP and orders/retries packets for RC. |
| RETH | Remote virtual address, rkey, length. | Present on WRITE/READ and authorizes remote memory access. |
| UDP source port | Hash-derived entropy for RoCE v2. | Lets ECMP spread flows while destination stays 4791. |
| IP ECN bits | ECT and CE marks. | Switches signal congestion to DCQCN without dropping packets. |
Minimal C Demo - RoCE Frame Builder
2. iWARP: RDMA over TCP
iWARP keeps the verbs model but runs over TCP. RDMAP expresses RDMA operations, DDP places data directly into buffers, and MPA preserves message framing over TCP byte streams.
| Choice | Fabric requirement | Latency/CPU profile | Typical fit |
|---|---|---|---|
| RoCE v2 | Lossless or near-lossless Ethernet with PFC and ECN/DCQCN. | Lowest latency, hardware transport, operationally sensitive. | AI/HPC, NVMe-oF, high-performance DC fabrics. |
| iWARP | Lossy Ethernet is acceptable because TCP recovers loss. | Higher overhead and latency, simpler fabric. | Commodity Ethernet and SMB Direct deployments where PFC is unavailable. |
3. Priority Flow Control
PFC is per-priority PAUSE. When a lossless queue crosses the xoff threshold, the switch asks its upstream neighbor to stop only that priority while best-effort traffic continues.
Headroom is the buffer needed to absorb packets already in flight after the PAUSE frame is generated. If headroom is too small, RoCE still drops and falls into retry behavior.
PFC can also create a circular wait. If every switch in a cycle pauses the next one on the same lossless priority, no queue drains and a watchdog must break the deadlock.
Minimal C Demo - PFC Threshold Calculator
Minimal C Demo - PFC Deadlock Visualizer
4. ECN and DCQCN
DCQCN is the RoCE congestion-control loop. The switch marks ECN CE, the receiver sends a CNP, and the sender reduces its hardware transmit rate before queues grow into PFC territory.
The sender keeps an alpha estimate of congestion severity. CNPs raise alpha and cut rate; quiet intervals decay alpha and allow additive or hyper increase.
Background - Why PFC Alone Is Not Enough
PFC prevents drops by stopping traffic, but stopping traffic can spread congestion backward through the fabric. DCQCN tries to keep queues below xoff by reducing senders earlier using ECN marks.
Minimal C Demo - DCQCN Rate Controller
5. Headroom, DCB, and Lossless Priority Planning
A production RoCE fabric combines PFC, ETS, DCBX, ECN thresholds, and explicit priority mapping. The usual design is one lossless RDMA priority and one or more lossy best-effort priorities.
Buffer planning separates reserved lossless headroom from shared lossy buffer. A PAUSE on priority 3 must not freeze ordinary TCP/IP traffic on priority 0.
Minimal C Demo - Headroom Budget
6. NVMe-oF RDMA, NFS/RDMA, and SMB Direct
RDMA storage keeps control messages as SEND/RECV capsules and moves bulk data with one-sided READ or WRITE. That split keeps command processing explicit while the large payload avoids the kernel socket data path.
Compared with NVMe-oF TCP, the RDMA transport removes much of the socket, copy, and interrupt work from the data movement path. The tradeoff is stricter NIC, memory-registration, and fabric configuration requirements.
| Protocol | RDMA role | Kernel/user components |
|---|---|---|
| NVMe-oF RDMA | SEND capsules, RDMA READ/WRITE for data. | nvme-rdma, nvmet-rdma. |
| NFS/RDMA | RPC-over-RDMA uses DDP for NFS payloads. | xprtrdma, svcrdma, sunrpc. |
| SMB Direct | SMB 3.x over RDMA NICs for low-copy file and VM migration traffic. | Windows SMB multichannel with RoCE or iWARP NICs. |
Minimal C Demo - NVMe-oF RDMA I/O Flow
7. Kernel Source Pointers
drivers/infiniband/core/cma.c- RDMA CM address and route resolution.drivers/infiniband/hw/mlx5/- mlx5 RoCE-capable provider implementation.drivers/nvme/host/rdma.c- NVMe-oF RDMA initiator transport.drivers/nvme/target/rdma.c- NVMe-oF RDMA target transport.net/sunrpc/xprtrdma/andnet/sunrpc/svcrdma/- NFS/RDMA client and server transports.
8. Interview Prep
| Question | Concise answer |
|---|---|
| Why does RoCE v2 use UDP? | UDP/IP makes RoCE routable and gives ECMP entropy through the source port while preserving IB transport headers above UDP. |
| What does PFC protect? | It protects a selected priority queue from drops by pausing that priority upstream; it does not provide congestion control by itself. |
| What is PFC headroom? | Reserved buffer for packets already in flight after a PAUSE frame is sent until the upstream sender actually stops. |
| How does DCQCN differ from DCTCP? | DCTCP uses TCP ACK ECN feedback; DCQCN uses RoCE CNP packets and NIC rate limiting because there are no TCP ACKs. |
| NVMe-oF RDMA read path? | Host SENDs a command; target reads media; target RDMA_WRITEs host buffer; target SENDs completion. |