Part XXIV - SPDK

32. SPDK Kernel-Bypass Storage

Reactor threads, userspace NVMe queues, BDEV composition, NVMe-oF targets, BlobStore, vhost-user, and the DPDK shared runtime.

1. Why SPDK Exists

Modern NVMe media can complete small I/O in single-digit or low-double-digit microseconds. At that scale, syscall crossings, per-I/O allocations, request-queue scheduling, interrupts, and wakeups become part of the latency budget. SPDK moves the NVMe driver into userspace, owns the PCI device through vfio-pci, maps BAR registers and DMA queues, and polls completions instead of sleeping.

Minimal C Demo - Latency Budget

Kernel vs SPDK Latency Budget — C Demo
stdin (optional)

2. Reactor Model

An SPDK reactor is an OS thread pinned to an lcore. It repeatedly polls NVMe completions, network transports, timers, BlobStore work, and its message queue. Because each hot resource is owned by one reactor, the common path avoids locks; cross-core coordination uses spdk_thread_send_msg().

Message passing is the escape hatch when reactor A needs reactor B to mutate B-owned state. The sender enqueues work and immediately returns; the receiver runs the function in its own loop, where the invariants are local again.

Minimal C Demo - Reactor Iteration

Reactor Loop Visualization — C Demo
stdin (optional)

3. Userspace NVMe Driver

SPDK creates admin and I/O queue pairs directly in DMA memory. A command is just a ring entry plus an MMIO doorbell: fill the SQ slot, write the SQ tail doorbell, poll the CQ phase tag, consume the CQE, and write the CQ head doorbell.

The phase tag is what lets a busy-loop distinguish a newly written completion from a stale ring entry after wraparound. No I/O interrupt is needed on the hot path.

ObjectRoleInterview detail
Admin queue pairIdentify, feature setup, queue creation.Used for controller management, not bulk data I/O.
I/O queue pairPer-reactor SQ/CQ pair for reads and writes.One qpair per thread avoids shared queue locks.
PRP/SGLPhysical region pages or scatter/gather list.Describes DMA buffers without copying user data.
DoorbellMMIO register in the NVMe BAR.Host writes tails/heads to notify controller progress.

Minimal C Demo - Queue Pair Walkthrough

NVMe Queue Pair Walkthrough — C Demo
stdin (optional)

4. BDEV Layer

BDEV is SPDK's block-device abstraction. Applications call spdk_bdev_read(),spdk_bdev_write(), spdk_bdev_flush(), orspdk_bdev_unmap(); modules translate those requests to NVMe, malloc, AIO, uring, RBD, RAID, crypto, compression, logical volumes, or QoS.

The important scaling primitive is the I/O channel: each thread gets a per-BDEV channel that maps to underlying per-thread state, such as an NVMe qpair. That is why stacked modules can stay asynchronous without central locks.

Minimal C Demo - BDEV Stack Builder

BDEV Stack Builder — C Demo
stdin (optional)

5. NVMe-oF Target

nvmf_tgt exposes BDEVs as NVMe namespaces over TCP, RDMA, or Fibre Channel. A subsystem has an NQN, host access rules, namespaces, and listeners; each listener binds a transport address such as TCP port 4420 or an RDMA address.

RDMA transport keeps command and response capsules on SEND/RECV and moves data with one-sided verbs. For a host read, the target RDMA WRITEs data into the initiator buffer; for a host write, the target often RDMA READs the initiator data.

Minimal C Demo - NVMe-oF Path

NVMe-oF TCP vs RDMA Path — C Demo
stdin (optional)

6. BlobStore and BlobFS

BlobStore is an object store on top of one BDEV. It allocates large clusters, records blob-to-cluster mappings in metadata pages, and commits metadata atomically with spdk_blob_sync_md(). BlobFS then adds a POSIX-like file namespace where each file is backed by a blob.

RocksDB-on-SPDK uses BlobFS to bypass VFS and page cache for SST files. The gain is not magic filesystem semantics; it is the same polling, DMA-buffer, callback-driven path used by the rest of SPDK.

7. vhost-user-blk and vhost-user-scsi

With vhost-user, QEMU negotiates virtio features over a Unix socket, then SPDK polls guest virtqueues directly in shared hugepage memory. The VM still sees a normal virtio-blk or virtio-scsi device, but QEMU is no longer on the hot data path.

ModeGuest viewOperational point
vhost-user-blkSingle virtio block device.Simple, fast path for one disk-like namespace.
vhost-user-scsivirtio-scsi controller with LUNs.Better for hot-add/remove and multiple logical devices.
QEMU emulationSame virtual device shape.QEMU thread becomes the bottleneck at high IOPS.

8. DPDK Integration and Deployments

SPDK relies on DPDK EAL for hugepage-backed DMA memory, lcore management, mempools, and rings. That shared model lets a storage target place network polling and NVMe polling on adjacent lcores with buffers from the same hugepage pool.

In a deployment, compute nodes use NVMe-oF initiators while storage servers run SPDK targets over RoCE or TCP. BDEV modules add policy such as crypto, QoS, RAID, or logical volumes before requests reach the local NVMe drives.

Minimal C Demo - Lcore Allocation

SPDK Lcore Allocation — C Demo
stdin (optional)

9. Source Pointers

AreaPaths and functionsWhy it matters
Reactorlib/event/reactor.c, spdk_poller_registerMain event loop, pollers, and lcore dispatch.
Thread messageslib/thread/thread.c, spdk_thread_send_msgCross-reactor work transfer without shared-state locks.
NVMe driverlib/nvme/, spdk_nvme_qpair_process_completionsSQ/CQ submission, doorbells, probe, and completions.
BDEVlib/bdev/, spdk_bdev_get_io_channelModule stacking and per-thread channels.
NVMe-oFlib/nvmf/, nvmf_tgtSubsystems, transports, capsules, and target pollers.
BlobStorelib/blob/, spdk_blob_sync_mdCluster allocation and crash-consistent metadata.

10. Interview Prep

Questions and concise answers
What costs does SPDK remove from the kernel path?Syscalls on the hot path, bio/request allocation, block scheduling, interrupts, wakeups, and shared queue lock contention.
What happens if a reactor callback blocks?Every device, socket, timer, and message owned by that reactor stops making progress until the callback returns.
How is an NVMe command submitted?Fill an SQ entry, write the SQ tail doorbell, poll the CQ phase tag, consume the CQE, and update the CQ head doorbell.
Why does BDEV need I/O channels?They provide per-thread backend state, such as qpairs, so stacked modules avoid central locks.
How does NVMe-oF RDMA serve a read?The initiator sends a command capsule, the target reads local storage, RDMA WRITEs data into the initiator buffer, then sends a response capsule.
How does vhost-user improve VM storage?SPDK polls guest virtqueues in shared memory, bypassing QEMU emulation on the data path.