Tech Notes

1. Why SPDK Exists

Modern NVMe media can complete small I/O in single-digit or low-double-digit microseconds. At that scale, syscall crossings, per-I/O allocations, request-queue scheduling, interrupts, and wakeups become part of the latency budget. SPDK moves the NVMe driver into userspace, owns the PCI device through vfio-pci, maps BAR registers and DMA queues, and polls completions instead of sleeping.

Minimal C Demo - Latency Budget

Kernel vs SPDK Latency Budget — C Demo

stdin (optional)

2. Reactor Model

An SPDK reactor is an OS thread pinned to an lcore. It repeatedly polls NVMe completions, network transports, timers, BlobStore work, and its message queue. Because each hot resource is owned by one reactor, the common path avoids locks; cross-core coordination uses spdk_thread_send_msg().

Message passing is the escape hatch when reactor A needs reactor B to mutate B-owned state. The sender enqueues work and immediately returns; the receiver runs the function in its own loop, where the invariants are local again.

Minimal C Demo - Reactor Iteration

Reactor Loop Visualization — C Demo

stdin (optional)

3. Userspace NVMe Driver

SPDK creates admin and I/O queue pairs directly in DMA memory. A command is just a ring entry plus an MMIO doorbell: fill the SQ slot, write the SQ tail doorbell, poll the CQ phase tag, consume the CQE, and write the CQ head doorbell.

The phase tag is what lets a busy-loop distinguish a newly written completion from a stale ring entry after wraparound. No I/O interrupt is needed on the hot path.

Object	Role	Interview detail
Admin queue pair	Identify, feature setup, queue creation.	Used for controller management, not bulk data I/O.
I/O queue pair	Per-reactor SQ/CQ pair for reads and writes.	One qpair per thread avoids shared queue locks.
PRP/SGL	Physical region pages or scatter/gather list.	Describes DMA buffers without copying user data.
Doorbell	MMIO register in the NVMe BAR.	Host writes tails/heads to notify controller progress.

Minimal C Demo - Queue Pair Walkthrough

NVMe Queue Pair Walkthrough — C Demo

stdin (optional)

4. BDEV Layer

BDEV is SPDK's block-device abstraction. Applications call spdk_bdev_read(),spdk_bdev_write(), spdk_bdev_flush(), orspdk_bdev_unmap(); modules translate those requests to NVMe, malloc, AIO, uring, RBD, RAID, crypto, compression, logical volumes, or QoS.

The important scaling primitive is the I/O channel: each thread gets a per-BDEV channel that maps to underlying per-thread state, such as an NVMe qpair. That is why stacked modules can stay asynchronous without central locks.

Minimal C Demo - BDEV Stack Builder

BDEV Stack Builder — C Demo

stdin (optional)

5. NVMe-oF Target

nvmf_tgt exposes BDEVs as NVMe namespaces over TCP, RDMA, or Fibre Channel. A subsystem has an NQN, host access rules, namespaces, and listeners; each listener binds a transport address such as TCP port 4420 or an RDMA address.

RDMA transport keeps command and response capsules on SEND/RECV and moves data with one-sided verbs. For a host read, the target RDMA WRITEs data into the initiator buffer; for a host write, the target often RDMA READs the initiator data.

Minimal C Demo - NVMe-oF Path

NVMe-oF TCP vs RDMA Path — C Demo

stdin (optional)

6. BlobStore and BlobFS

BlobStore is an object store on top of one BDEV. It allocates large clusters, records blob-to-cluster mappings in metadata pages, and commits metadata atomically with spdk_blob_sync_md(). BlobFS then adds a POSIX-like file namespace where each file is backed by a blob.

RocksDB-on-SPDK uses BlobFS to bypass VFS and page cache for SST files. The gain is not magic filesystem semantics; it is the same polling, DMA-buffer, callback-driven path used by the rest of SPDK.

7. vhost-user-blk and vhost-user-scsi

With vhost-user, QEMU negotiates virtio features over a Unix socket, then SPDK polls guest virtqueues directly in shared hugepage memory. The VM still sees a normal virtio-blk or virtio-scsi device, but QEMU is no longer on the hot data path.

Mode	Guest view	Operational point
vhost-user-blk	Single virtio block device.	Simple, fast path for one disk-like namespace.
vhost-user-scsi	virtio-scsi controller with LUNs.	Better for hot-add/remove and multiple logical devices.
QEMU emulation	Same virtual device shape.	QEMU thread becomes the bottleneck at high IOPS.

8. DPDK Integration and Deployments

SPDK relies on DPDK EAL for hugepage-backed DMA memory, lcore management, mempools, and rings. That shared model lets a storage target place network polling and NVMe polling on adjacent lcores with buffers from the same hugepage pool.

In a deployment, compute nodes use NVMe-oF initiators while storage servers run SPDK targets over RoCE or TCP. BDEV modules add policy such as crypto, QoS, RAID, or logical volumes before requests reach the local NVMe drives.

Minimal C Demo - Lcore Allocation

SPDK Lcore Allocation — C Demo

stdin (optional)

9. Source Pointers

Area	Paths and functions	Why it matters
Reactor	`lib/event/reactor.c`, `spdk_poller_register`	Main event loop, pollers, and lcore dispatch.
Thread messages	`lib/thread/thread.c`, `spdk_thread_send_msg`	Cross-reactor work transfer without shared-state locks.
NVMe driver	`lib/nvme/`, `spdk_nvme_qpair_process_completions`	SQ/CQ submission, doorbells, probe, and completions.
BDEV	`lib/bdev/`, `spdk_bdev_get_io_channel`	Module stacking and per-thread channels.
NVMe-oF	`lib/nvmf/`, `nvmf_tgt`	Subsystems, transports, capsules, and target pollers.
BlobStore	`lib/blob/`, `spdk_blob_sync_md`	Cluster allocation and crash-consistent metadata.

10. Interview Prep

Questions and concise answers

What costs does SPDK remove from the kernel path?	Syscalls on the hot path, bio/request allocation, block scheduling, interrupts, wakeups, and shared queue lock contention.
What happens if a reactor callback blocks?	Every device, socket, timer, and message owned by that reactor stops making progress until the callback returns.
How is an NVMe command submitted?	Fill an SQ entry, write the SQ tail doorbell, poll the CQ phase tag, consume the CQE, and update the CQ head doorbell.
Why does BDEV need I/O channels?	They provide per-thread backend state, such as qpairs, so stacked modules avoid central locks.
How does NVMe-oF RDMA serve a read?	The initiator sends a command capsule, the target reads local storage, RDMA WRITEs data into the initiator buffer, then sends a response capsule.
How does vhost-user improve VM storage?	SPDK polls guest virtqueues in shared memory, bypassing QEMU emulation on the data path.

32. SPDK Kernel-Bypass Storage

1. Why SPDK Exists

Minimal C Demo - Latency Budget

2. Reactor Model

Minimal C Demo - Reactor Iteration

3. Userspace NVMe Driver

Minimal C Demo - Queue Pair Walkthrough

4. BDEV Layer

Minimal C Demo - BDEV Stack Builder

5. NVMe-oF Target

Minimal C Demo - NVMe-oF Path

6. BlobStore and BlobFS

7. vhost-user-blk and vhost-user-scsi

8. DPDK Integration and Deployments

Minimal C Demo - Lcore Allocation

9. Source Pointers

10. Interview Prep