Part XXII - RDMA

28. RDMA Fundamentals

Verbs objects, queue-pair setup, memory registration, one-sided operations, and transport scaling.

1. Why RDMA Exists

RDMA moves the hot data path from kernel sockets into the HCA. The application registers memory once, posts work requests from user space, and the NIC performs DMA directly into local or remote application memory.

The hardware contract is strict: memory must be pinned and protected, the HCA must know the translation, and remote access must present the correct key before any DMA is accepted.

2. Verbs Objects: PD, MR, CQ, QP, WR

The verbs API is the user-space control plane for the HCA. A protection domain owns the resources; memory regions expose buffers; queue pairs hold work queues; completion queues report finished work requests.

ObjectCreated byWhat it protects or stores
PDibv_alloc_pdNamespace that prevents unrelated QPs and MRs from mixing.
MRibv_reg_mrPinned buffer, access flags, local key, remote key.
CQibv_create_cqCompletion entries for signaled sends and receives.
QPibv_create_qpSend queue, receive queue, QPN, transport type, and state.
WR / WCibv_post_send / poll CQWork request submitted by the app; work completion returned by HCA.

MR registration maps a virtual range into an HCA-visible page list. The returned lkeyauthorizes local DMA; the rkey is the capability a peer must present for READ or WRITE.

Minimal C Demo - Verbs Object Creation

Verbs Object Creation Flow — C Demo
stdin (optional)

3. Queue Pair State Machine

A QP starts unusable. The application uses ibv_modify_qp to install local port data, remote address data, packet sequence numbers, and retry policy before the HCA may transmit.

The remote QPN, LID or GID, PSN, address, and rkey usually arrive over a sideband channel such as TCP, RDMA CM, gRPC, MPI bootstrap, or an application-specific control plane.

Background - Why RTR Before RTS

RDMA hardware can transmit as soon as a QP is ready to send, so the peer must first be ready to receive or accept one-sided access. The connection setup therefore installs receive-side state before send-side state.

Plan

  1. Exchange endpoint identifiers and initial PSNs over the sideband.
  2. Move each QP to INIT with local port and access flags.
  3. Move each QP to RTR with the remote destination and receive PSN.
  4. Move each QP to RTS with local send PSN, timeout, and retry controls.

Minimal C Demo - QP Connection Setup

QP RESET to RTS Setup — C Demo
stdin (optional)

4. RDMA Operations

RDMA verbs split into two-sided messaging and one-sided memory operations. SEND consumes a posted receive on the peer; WRITE, READ, and ATOMIC operate on a registered remote address using an rkey.

WRITE is the common bulk transfer primitive because the remote CPU is not scheduled and no receive buffer is required. WRITE_WITH_IMM is the exception: it also delivers a small immediate value to a posted receive completion.

READ reverses the direction of data movement. The initiator asks the remote HCA to fetch from a registered memory range and complete only after the local destination buffer is filled.

Minimal C Demo - Operation Comparison

RDMA Operation Comparison — C Demo
stdin (optional)

Minimal C Demo - WRITE Size Tradeoff

RDMA WRITE Throughput Model — C Demo
stdin (optional)

5. Memory Region Deep Dive

Registration is a control-plane cost, not a per-I/O operation. The kernel pins pages, programs IOMMU translation, and tells the HCA which virtual range can be accessed with the new keys.

StrategyBest fitTradeoff
Register onceStatic pools, MPI buffers, storage queuesSimple and fast, but pins memory for the process lifetime.
MR cacheFrequent reuse of similar buffersAvoids repeated registration but needs eviction and reuse tracking.
ODPSparse or huge virtual rangesLower upfront pinning, but first-touch page faults add latency.
Fast registrationPer-I/O storage mappingsRekeys subranges faster than full registration, with more complex lifetime rules.

Minimal C Demo - Registration Cost

MR Registration Cost Breakdown — C Demo
stdin (optional)

6. QP Transports: RC, UC, UD, XRC

QP type determines reliability, supported operations, and scaling behavior. RC is the default for one-sided verbs; UD is the scalable datagram mode; XRC reduces receive-side state for many-to-one services.

XRC matters when many initiators connect to a target process. Instead of allocating a receive queue per remote peer, the target can share receive work requests through a shared receive queue.

Minimal C Demo - QP Scaling

QP Scaling Problem — C Demo
stdin (optional)

7. Kernel Source Pointers

  • drivers/infiniband/core/uverbs_main.c - user verbs device file and command dispatch.
  • drivers/infiniband/core/uverbs_cmd.c - create PD, MR, CQ, QP, and modify QP command paths.
  • drivers/infiniband/core/umem.c - user memory pinning and DMA mapping for registered memory.
  • drivers/infiniband/hw/mlx5/ - mlx5 provider implementation for ConnectX HCAs.
  • include/rdma/ib_verbs.h - kernel-level verbs structs and enums.

8. Interview Prep

QuestionConcise answer
Why does RDMA need memory registration?The HCA needs pinned pages, DMA translation, access flags, and keys; otherwise remote DMA would be unsafe and pages could move or swap.
What is the difference between lkey and rkey?lkey authorizes local HCA DMA from a process buffer; rkey is shared with a peer to authorize remote READ/WRITE/ATOMIC access.
Why pre-post receives for SEND?SEND is two-sided; the receiver HCA needs an RQ buffer before incoming payload arrives or the sender can hit receiver-not-ready behavior.
Why is RDMA WRITE often faster than READ?WRITE is a push path with initiator completion after transmit/order; READ needs a request and response before completion.
Why can RC QPs become a scaling problem?All-to-all RC requires per-peer connected state, retries, PSNs, and queues, which grows roughly as N squared.