Tech Notes

1. Why RDMA Exists

RDMA moves the hot data path from kernel sockets into the HCA. The application registers memory once, posts work requests from user space, and the NIC performs DMA directly into local or remote application memory.

The hardware contract is strict: memory must be pinned and protected, the HCA must know the translation, and remote access must present the correct key before any DMA is accepted.

2. Verbs Objects: PD, MR, CQ, QP, WR

The verbs API is the user-space control plane for the HCA. A protection domain owns the resources; memory regions expose buffers; queue pairs hold work queues; completion queues report finished work requests.

Object	Created by	What it protects or stores
PD	`ibv_alloc_pd`	Namespace that prevents unrelated QPs and MRs from mixing.
MR	`ibv_reg_mr`	Pinned buffer, access flags, local key, remote key.
CQ	`ibv_create_cq`	Completion entries for signaled sends and receives.
QP	`ibv_create_qp`	Send queue, receive queue, QPN, transport type, and state.
WR / WC	`ibv_post_send` / poll CQ	Work request submitted by the app; work completion returned by HCA.

MR registration maps a virtual range into an HCA-visible page list. The returned lkeyauthorizes local DMA; the rkey is the capability a peer must present for READ or WRITE.

Minimal C Demo - Verbs Object Creation

Verbs Object Creation Flow — C Demo

stdin (optional)

3. Queue Pair State Machine

A QP starts unusable. The application uses ibv_modify_qp to install local port data, remote address data, packet sequence numbers, and retry policy before the HCA may transmit.

The remote QPN, LID or GID, PSN, address, and rkey usually arrive over a sideband channel such as TCP, RDMA CM, gRPC, MPI bootstrap, or an application-specific control plane.

Background - Why RTR Before RTS

RDMA hardware can transmit as soon as a QP is ready to send, so the peer must first be ready to receive or accept one-sided access. The connection setup therefore installs receive-side state before send-side state.

Plan

Exchange endpoint identifiers and initial PSNs over the sideband.
Move each QP to INIT with local port and access flags.
Move each QP to RTR with the remote destination and receive PSN.
Move each QP to RTS with local send PSN, timeout, and retry controls.

Minimal C Demo - QP Connection Setup

QP RESET to RTS Setup — C Demo

stdin (optional)

4. RDMA Operations

RDMA verbs split into two-sided messaging and one-sided memory operations. SEND consumes a posted receive on the peer; WRITE, READ, and ATOMIC operate on a registered remote address using an rkey.

WRITE is the common bulk transfer primitive because the remote CPU is not scheduled and no receive buffer is required. WRITE_WITH_IMM is the exception: it also delivers a small immediate value to a posted receive completion.

READ reverses the direction of data movement. The initiator asks the remote HCA to fetch from a registered memory range and complete only after the local destination buffer is filled.

Minimal C Demo - Operation Comparison

RDMA Operation Comparison — C Demo

stdin (optional)

Minimal C Demo - WRITE Size Tradeoff

RDMA WRITE Throughput Model — C Demo

stdin (optional)

5. Memory Region Deep Dive

Registration is a control-plane cost, not a per-I/O operation. The kernel pins pages, programs IOMMU translation, and tells the HCA which virtual range can be accessed with the new keys.

Strategy	Best fit	Tradeoff
Register once	Static pools, MPI buffers, storage queues	Simple and fast, but pins memory for the process lifetime.
MR cache	Frequent reuse of similar buffers	Avoids repeated registration but needs eviction and reuse tracking.
ODP	Sparse or huge virtual ranges	Lower upfront pinning, but first-touch page faults add latency.
Fast registration	Per-I/O storage mappings	Rekeys subranges faster than full registration, with more complex lifetime rules.

Minimal C Demo - Registration Cost

MR Registration Cost Breakdown — C Demo

stdin (optional)

6. QP Transports: RC, UC, UD, XRC

QP type determines reliability, supported operations, and scaling behavior. RC is the default for one-sided verbs; UD is the scalable datagram mode; XRC reduces receive-side state for many-to-one services.

XRC matters when many initiators connect to a target process. Instead of allocating a receive queue per remote peer, the target can share receive work requests through a shared receive queue.

Minimal C Demo - QP Scaling

QP Scaling Problem — C Demo

stdin (optional)

7. Kernel Source Pointers

drivers/infiniband/core/uverbs_main.c - user verbs device file and command dispatch.
drivers/infiniband/core/uverbs_cmd.c - create PD, MR, CQ, QP, and modify QP command paths.
drivers/infiniband/core/umem.c - user memory pinning and DMA mapping for registered memory.
drivers/infiniband/hw/mlx5/ - mlx5 provider implementation for ConnectX HCAs.
include/rdma/ib_verbs.h - kernel-level verbs structs and enums.

8. Interview Prep

Question	Concise answer
Why does RDMA need memory registration?	The HCA needs pinned pages, DMA translation, access flags, and keys; otherwise remote DMA would be unsafe and pages could move or swap.
What is the difference between lkey and rkey?	lkey authorizes local HCA DMA from a process buffer; rkey is shared with a peer to authorize remote READ/WRITE/ATOMIC access.
Why pre-post receives for SEND?	SEND is two-sided; the receiver HCA needs an RQ buffer before incoming payload arrives or the sender can hit receiver-not-ready behavior.
Why is RDMA WRITE often faster than READ?	WRITE is a push path with initiator completion after transmit/order; READ needs a request and response before completion.
Why can RC QPs become a scaling problem?	All-to-all RC requires per-peer connected state, retries, PSNs, and queues, which grows roughly as N squared.

28. RDMA Fundamentals

1. Why RDMA Exists

2. Verbs Objects: PD, MR, CQ, QP, WR

Minimal C Demo - Verbs Object Creation

3. Queue Pair State Machine

Background - Why RTR Before RTS

Plan

Minimal C Demo - QP Connection Setup

4. RDMA Operations

Minimal C Demo - Operation Comparison

Minimal C Demo - WRITE Size Tradeoff

5. Memory Region Deep Dive

Minimal C Demo - Registration Cost

6. QP Transports: RC, UC, UD, XRC

Minimal C Demo - QP Scaling

7. Kernel Source Pointers

8. Interview Prep