28. RDMA Fundamentals
Verbs objects, queue-pair setup, memory registration, one-sided operations, and transport scaling.
1. Why RDMA Exists
RDMA moves the hot data path from kernel sockets into the HCA. The application registers memory once, posts work requests from user space, and the NIC performs DMA directly into local or remote application memory.
The hardware contract is strict: memory must be pinned and protected, the HCA must know the translation, and remote access must present the correct key before any DMA is accepted.
2. Verbs Objects: PD, MR, CQ, QP, WR
The verbs API is the user-space control plane for the HCA. A protection domain owns the resources; memory regions expose buffers; queue pairs hold work queues; completion queues report finished work requests.
| Object | Created by | What it protects or stores |
|---|---|---|
| PD | ibv_alloc_pd | Namespace that prevents unrelated QPs and MRs from mixing. |
| MR | ibv_reg_mr | Pinned buffer, access flags, local key, remote key. |
| CQ | ibv_create_cq | Completion entries for signaled sends and receives. |
| QP | ibv_create_qp | Send queue, receive queue, QPN, transport type, and state. |
| WR / WC | ibv_post_send / poll CQ | Work request submitted by the app; work completion returned by HCA. |
MR registration maps a virtual range into an HCA-visible page list. The returned lkeyauthorizes local DMA; the rkey is the capability a peer must present for READ or WRITE.
Minimal C Demo - Verbs Object Creation
3. Queue Pair State Machine
A QP starts unusable. The application uses ibv_modify_qp to install local port data, remote address data, packet sequence numbers, and retry policy before the HCA may transmit.
The remote QPN, LID or GID, PSN, address, and rkey usually arrive over a sideband channel such as TCP, RDMA CM, gRPC, MPI bootstrap, or an application-specific control plane.
Background - Why RTR Before RTS
RDMA hardware can transmit as soon as a QP is ready to send, so the peer must first be ready to receive or accept one-sided access. The connection setup therefore installs receive-side state before send-side state.
Plan
- Exchange endpoint identifiers and initial PSNs over the sideband.
- Move each QP to INIT with local port and access flags.
- Move each QP to RTR with the remote destination and receive PSN.
- Move each QP to RTS with local send PSN, timeout, and retry controls.
Minimal C Demo - QP Connection Setup
4. RDMA Operations
RDMA verbs split into two-sided messaging and one-sided memory operations. SEND consumes a posted receive on the peer; WRITE, READ, and ATOMIC operate on a registered remote address using an rkey.
WRITE is the common bulk transfer primitive because the remote CPU is not scheduled and no receive buffer is required. WRITE_WITH_IMM is the exception: it also delivers a small immediate value to a posted receive completion.
READ reverses the direction of data movement. The initiator asks the remote HCA to fetch from a registered memory range and complete only after the local destination buffer is filled.
Minimal C Demo - Operation Comparison
Minimal C Demo - WRITE Size Tradeoff
5. Memory Region Deep Dive
Registration is a control-plane cost, not a per-I/O operation. The kernel pins pages, programs IOMMU translation, and tells the HCA which virtual range can be accessed with the new keys.
| Strategy | Best fit | Tradeoff |
|---|---|---|
| Register once | Static pools, MPI buffers, storage queues | Simple and fast, but pins memory for the process lifetime. |
| MR cache | Frequent reuse of similar buffers | Avoids repeated registration but needs eviction and reuse tracking. |
| ODP | Sparse or huge virtual ranges | Lower upfront pinning, but first-touch page faults add latency. |
| Fast registration | Per-I/O storage mappings | Rekeys subranges faster than full registration, with more complex lifetime rules. |
Minimal C Demo - Registration Cost
6. QP Transports: RC, UC, UD, XRC
QP type determines reliability, supported operations, and scaling behavior. RC is the default for one-sided verbs; UD is the scalable datagram mode; XRC reduces receive-side state for many-to-one services.
XRC matters when many initiators connect to a target process. Instead of allocating a receive queue per remote peer, the target can share receive work requests through a shared receive queue.
Minimal C Demo - QP Scaling
7. Kernel Source Pointers
drivers/infiniband/core/uverbs_main.c- user verbs device file and command dispatch.drivers/infiniband/core/uverbs_cmd.c- create PD, MR, CQ, QP, and modify QP command paths.drivers/infiniband/core/umem.c- user memory pinning and DMA mapping for registered memory.drivers/infiniband/hw/mlx5/- mlx5 provider implementation for ConnectX HCAs.include/rdma/ib_verbs.h- kernel-level verbs structs and enums.
8. Interview Prep
| Question | Concise answer |
|---|---|
| Why does RDMA need memory registration? | The HCA needs pinned pages, DMA translation, access flags, and keys; otherwise remote DMA would be unsafe and pages could move or swap. |
| What is the difference between lkey and rkey? | lkey authorizes local HCA DMA from a process buffer; rkey is shared with a peer to authorize remote READ/WRITE/ATOMIC access. |
| Why pre-post receives for SEND? | SEND is two-sided; the receiver HCA needs an RQ buffer before incoming payload arrives or the sender can hit receiver-not-ready behavior. |
| Why is RDMA WRITE often faster than READ? | WRITE is a push path with initiator completion after transmit/order; READ needs a request and response before completion. |
| Why can RC QPs become a scaling problem? | All-to-all RC requires per-peer connected state, retries, PSNs, and queues, which grows roughly as N squared. |