32. SPDK Kernel-Bypass Storage
Reactor threads, userspace NVMe queues, BDEV composition, NVMe-oF targets, BlobStore, vhost-user, and the DPDK shared runtime.
1. Why SPDK Exists
Modern NVMe media can complete small I/O in single-digit or low-double-digit microseconds. At that scale, syscall crossings, per-I/O allocations, request-queue scheduling, interrupts, and wakeups become part of the latency budget. SPDK moves the NVMe driver into userspace, owns the PCI device through vfio-pci, maps BAR registers and DMA queues, and polls completions instead of sleeping.
Minimal C Demo - Latency Budget
2. Reactor Model
An SPDK reactor is an OS thread pinned to an lcore. It repeatedly polls NVMe completions, network transports, timers, BlobStore work, and its message queue. Because each hot resource is owned by one reactor, the common path avoids locks; cross-core coordination uses spdk_thread_send_msg().
Message passing is the escape hatch when reactor A needs reactor B to mutate B-owned state. The sender enqueues work and immediately returns; the receiver runs the function in its own loop, where the invariants are local again.
Minimal C Demo - Reactor Iteration
3. Userspace NVMe Driver
SPDK creates admin and I/O queue pairs directly in DMA memory. A command is just a ring entry plus an MMIO doorbell: fill the SQ slot, write the SQ tail doorbell, poll the CQ phase tag, consume the CQE, and write the CQ head doorbell.
The phase tag is what lets a busy-loop distinguish a newly written completion from a stale ring entry after wraparound. No I/O interrupt is needed on the hot path.
| Object | Role | Interview detail |
|---|---|---|
| Admin queue pair | Identify, feature setup, queue creation. | Used for controller management, not bulk data I/O. |
| I/O queue pair | Per-reactor SQ/CQ pair for reads and writes. | One qpair per thread avoids shared queue locks. |
| PRP/SGL | Physical region pages or scatter/gather list. | Describes DMA buffers without copying user data. |
| Doorbell | MMIO register in the NVMe BAR. | Host writes tails/heads to notify controller progress. |
Minimal C Demo - Queue Pair Walkthrough
4. BDEV Layer
BDEV is SPDK's block-device abstraction. Applications call spdk_bdev_read(),spdk_bdev_write(), spdk_bdev_flush(), orspdk_bdev_unmap(); modules translate those requests to NVMe, malloc, AIO, uring, RBD, RAID, crypto, compression, logical volumes, or QoS.
The important scaling primitive is the I/O channel: each thread gets a per-BDEV channel that maps to underlying per-thread state, such as an NVMe qpair. That is why stacked modules can stay asynchronous without central locks.
Minimal C Demo - BDEV Stack Builder
5. NVMe-oF Target
nvmf_tgt exposes BDEVs as NVMe namespaces over TCP, RDMA, or Fibre Channel. A subsystem has an NQN, host access rules, namespaces, and listeners; each listener binds a transport address such as TCP port 4420 or an RDMA address.
RDMA transport keeps command and response capsules on SEND/RECV and moves data with one-sided verbs. For a host read, the target RDMA WRITEs data into the initiator buffer; for a host write, the target often RDMA READs the initiator data.
Minimal C Demo - NVMe-oF Path
6. BlobStore and BlobFS
BlobStore is an object store on top of one BDEV. It allocates large clusters, records blob-to-cluster mappings in metadata pages, and commits metadata atomically with spdk_blob_sync_md(). BlobFS then adds a POSIX-like file namespace where each file is backed by a blob.
RocksDB-on-SPDK uses BlobFS to bypass VFS and page cache for SST files. The gain is not magic filesystem semantics; it is the same polling, DMA-buffer, callback-driven path used by the rest of SPDK.
7. vhost-user-blk and vhost-user-scsi
With vhost-user, QEMU negotiates virtio features over a Unix socket, then SPDK polls guest virtqueues directly in shared hugepage memory. The VM still sees a normal virtio-blk or virtio-scsi device, but QEMU is no longer on the hot data path.
| Mode | Guest view | Operational point |
|---|---|---|
| vhost-user-blk | Single virtio block device. | Simple, fast path for one disk-like namespace. |
| vhost-user-scsi | virtio-scsi controller with LUNs. | Better for hot-add/remove and multiple logical devices. |
| QEMU emulation | Same virtual device shape. | QEMU thread becomes the bottleneck at high IOPS. |
8. DPDK Integration and Deployments
SPDK relies on DPDK EAL for hugepage-backed DMA memory, lcore management, mempools, and rings. That shared model lets a storage target place network polling and NVMe polling on adjacent lcores with buffers from the same hugepage pool.
In a deployment, compute nodes use NVMe-oF initiators while storage servers run SPDK targets over RoCE or TCP. BDEV modules add policy such as crypto, QoS, RAID, or logical volumes before requests reach the local NVMe drives.
Minimal C Demo - Lcore Allocation
9. Source Pointers
| Area | Paths and functions | Why it matters |
|---|---|---|
| Reactor | lib/event/reactor.c, spdk_poller_register | Main event loop, pollers, and lcore dispatch. |
| Thread messages | lib/thread/thread.c, spdk_thread_send_msg | Cross-reactor work transfer without shared-state locks. |
| NVMe driver | lib/nvme/, spdk_nvme_qpair_process_completions | SQ/CQ submission, doorbells, probe, and completions. |
| BDEV | lib/bdev/, spdk_bdev_get_io_channel | Module stacking and per-thread channels. |
| NVMe-oF | lib/nvmf/, nvmf_tgt | Subsystems, transports, capsules, and target pollers. |
| BlobStore | lib/blob/, spdk_blob_sync_md | Cluster allocation and crash-consistent metadata. |
10. Interview Prep
Questions and concise answers
| What costs does SPDK remove from the kernel path? | Syscalls on the hot path, bio/request allocation, block scheduling, interrupts, wakeups, and shared queue lock contention. |
| What happens if a reactor callback blocks? | Every device, socket, timer, and message owned by that reactor stops making progress until the callback returns. |
| How is an NVMe command submitted? | Fill an SQ entry, write the SQ tail doorbell, poll the CQ phase tag, consume the CQE, and update the CQ head doorbell. |
| Why does BDEV need I/O channels? | They provide per-thread backend state, such as qpairs, so stacked modules avoid central locks. |
| How does NVMe-oF RDMA serve a read? | The initiator sends a command capsule, the target reads local storage, RDMA WRITEs data into the initiator buffer, then sends a response capsule. |
| How does vhost-user improve VM storage? | SPDK polls guest virtqueues in shared memory, bypassing QEMU emulation on the data path. |