§ 9.1 – 9.19 DPDK: Kernel Bypass · EAL · PMD · RSS · Mbuf · NUMA · rte_ring · ACL/LPM · SR-IOV · Offload · SIMD · virtio · Pipeline · Hot Upgrade · Session · QoS · Bonding · OVS-DPDK
Why kernel networking fails at line rate (§9.1) · EAL startup 10-step flow (§9.2) · PMD RX/TX rings (§9.3) · multi-queue RSS (§9.4) · huge pages and mbufs (§9.5) · NUMA locality, lock-free rings, cache optimization, hardware classification, SR-IOV, NIC offloads, SIMD batching, virtio/vhost-user, R2C/pipeline models, hot upgrade, session fast/slow paths, QoS, bonding, and OVS-DPDK (§9.6–§9.19)
1. Overview
DPDK (Data Plane Development Kit) eliminates every per-packet overhead in the Linux kernel networking stack: interrupts, system calls, sk_buff allocation, copy_to_user(), and lock contention in netfilter and the routing table. Instead, a Poll Mode Driver (PMD) runs entirely in userspace, mapping NIC registers directly via UIO or VFIO, and spins in a tight loop checking descriptor DD (Descriptor Done) bits — no interrupts, no syscalls, no copies.
The result: from ~1 Mpps maximum in the kernel to 14.88 Mpps line rate at 64-byte frames on a single 10 GbE port with a single core.
2. § 9.1 — Why DPDK: Kernel Bypass Architecture
The Six Kernel Networking Bottlenecks
At 10 Gbps with 64-byte frames, the NIC delivers 14.88 million packets per second. Each packet gets only 67 nanoseconds of CPU time. The Linux kernel path burns most of that budget before the application even sees the data.
| Bottleneck | Cost | DPDK Elimination |
|---|---|---|
| IRQ per packet | ~2–5 µs total (handler + softirq schedule) | Poll mode — DD bit check, zero interrupts |
| sk_buff allocation | slab allocator per packet, ~200 ns, cache miss | Pre-allocated mbuf pool in huge pages, O(1) from per-lcore cache |
| copy_to_user() | memcpy kernel→user, pollutes cache | Packet stays in huge-page mbuf, app reads in-place — zero copy |
| recv() syscall | context switch ~100–200 ns | No syscall — PMD loop is pure userspace |
| Lock contention | netfilter, routing table, socket hash under high PPS | No kernel stack at all — app owns the data path |
| Cache pollution | kernel stack traversal touches many cold lines | Huge pages + DDIO → packets land in LLC before CPU reads them |
UIO vs VFIO vs AF_XDP — Bypass Mechanisms Compared
The NIC must be detached from its kernel driver and handed to userspace. Three mechanisms exist, each with different security and performance trade-offs.
| Mechanism | Module | IOMMU | Root needed | Container-safe | Best for |
|---|---|---|---|---|---|
| UIO (igb_uio) | igb_uio.ko | No — DMA unrestricted | Yes | No | Dev/test, trusted bare-metal |
| VFIO (vfio-pci) | vfio-pci.ko | Yes — IOMMU group isolation | Yes (or CAP_SYS_ADMIN) | Yes | Production, SR-IOV VFs, containers |
| AF_XDP (kernel) | built-in | Kernel handles DMA | No (CAP_NET_RAW) | Yes | Keep kernel features, selectively accelerate |
| Kernel PMD (af_packet) | built-in | N/A | No | Yes | Debug, low PPS, compatibility |
How VFIO Works
VFIO groups devices by IOMMU group (devices that share an IOMMU context must be in the same group). The workflow: unbind the NIC from its kernel driver → echo vfio-pci > /sys/bus/pci/.../driver_override → DPDK EAL opens /dev/vfio/<group> → calls VFIO_MAP_DMA ioctl to register hugepage memory with the IOMMU → NIC can only DMA into registered regions. UIO skips the IOMMU entirely: faster to set up, but a buggy (or malicious) userspace process can DMA to any physical address.
3. § 9.2 — DPDK Startup Flow: EAL Initialization
rte_eal_init() — 10-Step Sequence
Every DPDK application calls rte_eal_init(argc, argv) as its first act. This one call bootstraps the entire DPDK runtime: CPU topology, memory, devices, and worker threads. If it returns a negative number, the application must exit — the environment is not usable.
| CLI Flag | Purpose |
|---|---|
--lcores 0-3 | Use logical cores 0, 1, 2, 3 — EAL creates one pthread per lcore |
--socket-mem 4096,4096 | Allocate 4 GB of huge pages on NUMA socket 0 and socket 1 |
--file-prefix myapp | Namespace for hugepage files — allows multiple DPDK instances on same host |
--proc-type primary | This process owns hugepages and devices; secondary processes attach later |
--proc-type secondary | Attach to an existing primary's shared memory (hot upgrade pattern) |
--allow 01:00.0 | Whitelist (probe) only this PCI device; all others are ignored |
--vdev net_ring0 | Create a virtual device (ring PMD) — useful for testing without real NIC |
Multi-Process Mode — Shared Hugepage Memory
DPDK supports running multiple cooperating processes on one host. The primary process allocates hugepages and initializes devices; one or more secondary processes attach by mmap()-ing the same hugepage files. Data structures stored in named rte_memzone regions (mempool, rings, flow tables) are accessible from both — at the same virtual addresses, because DPDK maps the files at a fixed base address. This is the foundation of DPDK hot upgrade.
Minimal C Demo — EAL Startup Simulation
Real rte_eal_init() requires DPDK libraries and hardware. This simulation traces the same 10 steps in plain C so you can follow the sequence mentally.
4. § 9.3 — Poll Mode Driver (PMD) — Deep Dive
PMD Initialization Sequence
After EAL init, the application configures each NIC port in three steps: set queue counts and offload flags, allocate descriptor rings, then start the device. Each step maps directly to a NIC register write via the mapped BAR.
| API Call | What it does to the NIC |
|---|---|
| rte_eth_dev_configure(port, nb_rxq, nb_txq, &conf) | Writes NIC control registers: queue count, RSS enable, offload flags |
| rte_eth_rx_queue_setup(port, q, nb_desc, socket, &rxconf, mp) | Allocates desc ring (DMA-coherent), fills each desc with a mempool mbuf's iova |
| rte_eth_tx_queue_setup(port, q, nb_desc, socket, &txconf) | Allocates TX desc ring; sets tx_free_thresh (batch-free completed mbufs) |
| rte_eth_dev_start(port) | Enables NIC, configures MAC filter, enables RX/TX, links up |
| rte_eth_rx_burst(port, q, mbufs, 32) | Hot path: scans DD bits, harvests up to 32 mbufs, refills ring, rings doorbell |
| rte_eth_tx_burst(port, q, mbufs, n) | Hot path: fills TX descs with mbuf iovas, rings TDT doorbell, checks tx_free_thresh |
RX Descriptor Ring — NIC Fills, PMD Drains
The RX ring is a circular array of fixed-size descriptors in DMA-accessible memory (inside huge pages). The PMD pre-fills every slot with the physical address (buf_iova) of an empty mbuf from the mempool. When a packet arrives, the NIC DMA-writes the packet bytes into the pointed-to buffer and sets DD=1. The PMD polls, harvests completed descriptors, and immediately refills each slot with a fresh mbuf before ringing the doorbell (writing the new tail index to the RDT BAR register).
TX Descriptor Ring — PMD Fills, NIC Drains
TX is symmetric. The PMD fills each descriptor with the outgoing mbuf's buf_iova, the packet length, and command flags (EOP = end of packet, RS = report status, IFCS = insert CRC). It then writes the new tail to the TDT BAR register (the TX doorbell). The NIC DMA-reads the packet bytes over PCIe and transmits. The PMD reclaims completed descriptors (DD=1) in batches of tx_free_thresh (default 32) to amortize the mempool free cost.
PMD RX Burst — Step-by-Step Code Path
Burst Design — Why 32 Packets?
rte_eth_rx_burst() processes up to 32 packets per call. This is not arbitrary:
- Amortizes the doorbell write — one PCIe transaction to update RDT costs ~100 ns; doing it once per 32 packets costs 3 ns per packet amortized.
- Fits in a cache line prefetch window — with a prefetch-ahead distance of 3–4, 32 descriptors keep the CPU pipeline full without exceeding L1 capacity.
- Aligns with SIMD width — 32 × 64-bit descriptors = 256 bytes, fitting in 4 AVX2 registers for batch DD-bit checking.
Minimal C Demo — PMD RX Descriptor Ring
Minimal C Demo — PMD TX Path
5. § 9.4 — NIC Multi-Queue, RSS & Flow Classification
Overview — One Queue per Core, No Locks
Modern NICs expose dozens or hundreds of RX/TX queues. Each queue is an independent descriptor ring, so the standard DPDK design is queue-to-lcore ownership: lcore 0 polls queue 0, lcore 1 polls queue 1, and so on. No two cores mutate the same RX ring, the application avoids locks on the hot path, and packet ordering is preserved inside a flow because RSS sends the same 5-tuple to the same queue.
Key Data Structures — RSS RETA and Flow Rules
| Structure / API | Fields that matter | Purpose |
|---|---|---|
| RSS key | 40 bytes on many NICs | Seed used by Toeplitz hash; changing it changes distribution |
| RSS hash input | src/dst IP, src/dst port, protocol | Stable flow identity; same 5-tuple maps to same queue |
| RETA | 128 or 512 entries, each entry = queue id | hash index → RX queue; lets software rebalance queues without changing hash |
| rte_flow pattern | eth, ipv4, tcp, udp, vxlan, masks | Exact or masked match over packet headers |
| rte_flow action | queue, rss, drop, mark, count, encap, decap | Hardware action taken before packet DMA reaches memory |
Core Mechanism — RSS Queue Selection
Background: A 100 Gbps NIC cannot push all packets through one RX ring. It must spread flows across cores without reordering packets inside one TCP connection.
Plan: 1) hash the packet 5-tuple in hardware, 2) mask the hash into the RSS indirection table, 3) read the selected queue id, 4) DMA the packet into that queue's descriptor ring, 5) let the owning lcore poll it.
Example: A TCP flow 10.0.0.1:12345 → 10.0.0.2:443 hashes to 0x91ab0025. If the RETA has 128 entries, index 0x25 is read; if that entry contains queue 3, every packet in that flow is DMA-written into RX queue 3 and processed by lcore 3.
rte_flow — Precise Hardware Classification
RSS is probabilistic load distribution. rte_flow is explicit steering: match a packet pattern, then run an action such as queue, drop, mark, count, RSS, encap, or decap. Smart NIC projects use this to steer control-plane traffic, tenant ports, or tunnel flows before the packet ever touches CPU caches.
Minimal C Demo — RSS Queue Selection
6. § 9.5 — Memory Management: Huge Pages, Mempool & Mbuf
Overview — Pre-Allocate Everything the NIC Will Touch
DPDK avoids runtime allocation in the packet path. EAL maps pinned huge pages, builds NUMA-local memzones, then creates rte_mempool objects full of fixed-size rte_mbuf packet buffers. RX descriptors point directly at mbuf data rooms by IOVA, so the NIC can DMA packets into memory that userspace already owns.
Key Data Structures — Huge Page, Mempool, Mbuf
| Object | Fields / Parameters | Purpose |
|---|---|---|
| Huge page | 2 MB or 1 GB, mmap()+mlock(), registered with VFIO | Large pinned DMA memory; reduces TLB misses and prevents swap |
| rte_mempool | name, n, cache_size, priv_size, data_room_size, socket_id | Fixed-size object allocator, usually for mbufs |
| Per-lcore cache | array of object pointers, usually 256-512 entries | Fast alloc/free without CAS or shared cache-line bouncing |
| Central ring | lock-free rte_ring backend | Bulk refill/drain path when local cache is empty or full |
| rte_mbuf | buf_addr, buf_iova, data_off, data_len, pkt_len, next, ol_flags | Packet metadata plus data room used by NIC and application |
rte_mbuf Layout
An mbuf begins with hot metadata, then packet headroom, packet bytes, and tailroom. The packet pointer is not always buf_addr; DPDK computes it as buf_addr + data_off, which is what rte_pktmbuf_mtod() returns.
Core Mechanism — Mempool Fast Path
Background: At 20 Mpps, a normal malloc/free per packet would destroy throughput through locks, metadata writes, and cache misses.
Plan: 1) allocate all mbufs at startup from huge pages, 2) let each lcore allocate from its local cache, 3) refill or drain in bulk from the central ring, 4) return mbufs to the same NUMA socket whenever possible.
Example: lcore 2 starts with 4 cached mbufs. It receives a burst of 32 packets, consumes its 4 local objects, then bulk-pulls 32 more from the central ring. The shared ring CAS cost is paid once for the batch, not once per packet.
Multi-Segment Mbuf Chain
Jumbo frames, TSO, and scatter-gather I/O use chained mbufs. The first mbuf storespkt_len for the whole packet and nb_segs for the chain length; each segment stores its own data_len. The NIC can transmit the chain by DMA reading each segment address, avoiding a linear copy.
Minimal C Demo — Mempool + Mbuf Fast Path
7. § 9.6 — NUMA-Aware Programming in DPDK
Overview — Keep NIC, Queue, Mempool, and Lcore on One Socket
NUMA is not a small tuning detail in DPDK. A packet path touches the NIC DMA engine, the RX descriptor ring, the mbuf metadata, the packet bytes, and the polling lcore. If any of those live on the wrong socket, every packet pays an inter-socket hop. At high PPS that becomes a throughput cliff, commonly a 30–50% drop.
Wrong Placement — Remote Memory on Every Packet
The classic mistake is binding a NIC on socket 0, allocating its mempool on socket 0, but polling it from a lcore on socket 1. The NIC DMA is local, but the CPU reads packet bytes and updates mbuf metadata through the interconnect. The fix is mechanical: use rte_eth_dev_socket_id(port) for the NIC, allocate with that socket_id, and assign only lcores whose rte_socket_id() matches.
Key Data Structures — NUMA Placement Inputs
| API / File | Returns | How to use it |
|---|---|---|
| rte_eth_dev_socket_id(port) | NUMA socket of the PCI device | Choose mempool socket and lcore set for this port |
| rte_socket_id() | NUMA socket of the current lcore | Validate that the polling lcore is local to the port |
| /sys/bus/pci/devices/<BDF>/numa_node | Kernel view of the NIC's NUMA node | Debug bad topology or BIOS/ACPI reporting issues |
| rte_malloc_socket(size, align, socket) | Memory from a specific socket | Allocate per-lcore flow tables, stats, rings, and queues locally |
| rte_ring_create(name, count, socket, flags) | Ring metadata and slots on a socket | For cross-socket rings, bias toward the consumer socket |
Core Mechanism — Locality Walkthrough
Background: A dual-socket host has port 0 attached to socket 0. You need to decide where to allocate the mempool and which lcore should poll RX queue 0.
Plan: 1) read the NIC socket, 2) allocate the mempool and descriptor rings on that socket, 3) pin the polling lcore to the same socket, 4) keep per-lcore tables and stats on the same socket, 5) use rings only when a packet must cross sockets.
Example: port 0 reports socket 0. Mempool mp0 is created with socket 0. lcore 2 also reports socket 0, so it polls queue 0. Packet bytes DMA into socket 0 memory, DDIO places them in the local LLC, and lcore 2 reads them without a remote hop.
Minimal C Demo — NUMA Placement Check
8. § 9.7 — Lock-Free Ring Buffer: rte_ring Deep Dive
Overview — Bounded FIFO Without Locks
rte_ring is DPDK's shared queue primitive. It is a power-of-two circular array of object pointers with separate producer and consumer cursors. Multi-producer mode uses CAS to reserve a range of slots, then publishes the range by advancing prod.tail. Single-producer and single-consumer modes remove the CAS and become plain loads/stores plus barriers.
Key Data Structures — Head/Tail Split
| Field | Writer | Purpose |
|---|---|---|
prod.head | producer CAS or store | Reservation cursor; producers claim slots by moving this first |
prod.tail | producer store after copy | Publication cursor; consumers cannot see objects until this advances |
cons.head | consumer CAS or store | Reservation cursor for dequeue |
cons.tail | consumer store after read | Publication cursor; producers use it to calculate free space |
mask | constant | Wrap index with index & mask instead of slow modulo |
ring[] | producers write, consumers read | Contiguous array of void* object pointers |
Core Mechanism — Multi-Producer Enqueue
Background: Two worker lcores need to enqueue packets to one TX lcore without a mutex. Both may enter the enqueue path at the same time.
Plan: 1) reserve slots by CAS-ing prod.head, 2) copy objects into the reserved slots, 3) issue a write barrier, 4) wait until any earlier producer has published, 5) advance prod.tail.
Example: producer A reserves slots 8–11 and producer B reserves slots 12–15. B may finish copying first, but it cannot publish tail 16 until A publishes tail 12. That spin-wait preserves FIFO order for the consumer.
Minimal C Demo — MPMC Enqueue
9. § 9.8 — Cache Optimization in DPDK
Overview — Performance Is Cache-Line Ownership
DPDK's hot path is designed around cache lines: align frequently written fields, avoid false sharing, prefetch packet metadata before use, and keep data owned by the lcore that mutates it. At 20 Mpps, a single remote cache-line transfer can cost more than the useful packet work.
Key Techniques — Alignment, Prefetch, MESI, DDIO
| Technique | What it prevents | DPDK pattern |
|---|---|---|
| __rte_cache_aligned | False sharing between unrelated hot fields | Place prod and cons ring cursors on separate 64-byte cache lines |
| rte_prefetch0() | Waiting on DRAM or LLC when packet is first touched | Prefetch mbuf i+3 while processing mbuf i |
| Per-lcore data | MESI S->I invalidation traffic across cores | Stats, flow caches, and scratch buffers owned by one lcore |
| DDIO | Packet DMA landing only in DRAM | NIC writes packet bytes into LLC, so PMD reads at cache latency |
| Write combining | Many small PCIe doorbell writes | Batch tail updates and map BAR regions WC where supported |
Core Mechanism — Prefetch Pipeline
Background: The PMD receives a burst of 32 mbufs. Each mbuf points to packet bytes that may be in LLC because of DDIO, but the metadata and payload still need to be pulled into L1 before parsing.
Plan: 1) process the current mbuf, 2) prefetch a future mbuf 3–4 packets ahead, 3) keep the CPU doing useful parsing while the cache hierarchy fetches the future packet, 4) tune the distance so it hides latency without evicting useful data.
Example: while parsing mbuf[0], the loop prefetches mbuf[3]. By the time the loop reachesmbuf[3], its first cache lines are already in L1.
Minimal C Demo — Prefetch-Ahead Loop
10. § 9.9 — ACL & LPM Classification Libraries
Overview — Classify Packets Before the Slow Path
Fast packet processing is mostly classification: decide which route, tenant, ACL rule, or session a packet belongs to without touching a long chain of branches. DPDK provides three core libraries for that job: rte_lpm for IPv4 longest-prefix route lookup, rte_acl for multi-field packet rules, and rte_hash for exact-match flow state.
Key Data Structures — LPM, ACL, Hash
| Library | Important fields / shape | Purpose |
|---|---|---|
rte_lpm | tbl24[2^24] plus optional tbl8 groups; entry has valid, ext_entry, depth, next_hop | IPv4 route lookup in one or two memory accesses |
rte_lpm6 | multi-stride trie for 128-bit IPv6 prefixes | IPv6 route lookup without a 2^64 direct table |
rte_acl | compiled DFA/trie from rules over src/dst IP, ports, protocol, priority | Batch ACL lookup using SIMD state transitions |
rte_hash | cuckoo buckets, signatures, key store, optional data pointer | Exact 5-tuple/session lookup with two candidate buckets |
Core Mechanism — DIR-24-8 Longest Prefix Match
Background: A router must map every destination IP to the most specific route. A trie works but costs several dependent memory loads per packet.
Plan: 1) use the top 24 bits as a direct array index, 2) return immediately for /0 through /24 routes, 3) follow one extra 256-entry table only for /25 through /32 routes, 4) store the selected next hop in the matching entry.
Example: route 203.0.113.0/24 has next hop 3. A more specific 203.0.113.200/32 has next hop 7. The first 24 bits index tbl24; theext_entry bit tells lookup to read the low 8 bits from tbl8 and return 7 for that one host.
Cuckoo Hashing — Exact Flow Lookup
rte_hash uses cuckoo hashing: every key has two candidate buckets, each bucket stores compact signatures for SIMD-friendly comparison, and insertions relocate existing keys only when both buckets are full. This keeps lookup predictable: compute two hashes, compare bucket signatures, then verify the full key on a signature hit.
Minimal C Demo — DIR-24-8 Lookup
11. § 9.10 — SR-IOV in DPDK
Overview — Split One NIC into Hardware Tenants
SR-IOV exposes one physical NIC as a Physical Function (PF) plus many Virtual Functions (VFs). The PF owns full device control: VF creation, MAC/VLAN filters, link settings, and policy. Each VF appears as its own PCIe function with private RX/TX queues and can be bound to vfio-pci for a DPDK app, container, or VM passthrough datapath.
Key Data Structures — PF, VF, Queue, IOMMU Domain
| Object | Fields / ownership | Purpose |
|---|---|---|
| PF | full PCIe function, admin queues, VF control registers | Creates VFs and programs per-VF policy |
| VF | separate BDF, queue pairs, MAC/VLAN filters, limited registers | Tenant-facing datapath with near-native DMA |
| IOMMU group | VFIO container, group fd, DMA mappings | Prevents one VF from DMA-writing another tenant's memory |
| PF flow policy | MAC, VLAN, ethertype, queue, VF id | Steers hardware-classified traffic to the right VF |
Core Mechanism — VF Passthrough Walkthrough
Background: A VM needs low-latency networking, but assigning the whole PF would give it control over every tenant on the NIC.
Plan: 1) enable VFs from the PF, 2) bind a VF to vfio-pci, 3) map guest or DPDK hugepage memory through the IOMMU, 4) configure VF MAC/VLAN rules from the PF, 5) let the VF poll its private queues directly.
Example: VF 3 is assigned to tenant A with MAC 52:54:00:aa:00:03 and VLAN 120. The PF programs filters so only those packets enter VF 3 queues; VF 3 DMA is restricted to tenant A memory by the IOMMU.
12. § 9.11 — Hardware Offload
Overview — Move Mechanical Packet Work into the NIC
Hardware offload removes repetitive per-packet work from the CPU: checksum generation, checksum verification, TCP segmentation, tunnel checksum handling, timestamping, and flow steering. In DPDK, the application still owns the packet; it marks the mbuf withol_flags and length fields so the PMD can describe the operation in the TX descriptor.
Key Data Structures — Offload Contract
| Field / flag | Direction | Meaning |
|---|---|---|
RTE_ETH_TX_OFFLOAD_IPV4_CKSUM | TX config | Port capability bit: NIC can compute IPv4 header checksum |
RTE_MBUF_F_TX_TCP_CKSUM | mbuf TX | This packet needs TCP checksum computed by NIC |
RTE_MBUF_F_TX_TCP_SEG | mbuf TX | This packet is a large TCP segment and needs TSO |
l2_len / l3_len / l4_len | mbuf TX | Header boundaries; NIC needs them to find checksum fields |
tso_segsz | mbuf TX | MSS used by NIC when splitting a large segment |
RTE_MBUF_F_RX_IP_CKSUM_GOOD | mbuf RX | NIC verified checksum and reported success |
RTE_ETH_RX_OFFLOAD_TIMESTAMP | RX config | NIC writes hardware timestamp metadata for latency/PTP |
Core Mechanism — TSO and Checksum Offload
Background: A TCP sender wants to transmit 64 KB of payload. If the CPU splits it into MTU-sized packets and computes every checksum, it burns cycles on work the NIC can do while transmitting.
Plan: 1) build one large mbuf chain, 2) set header lengths andtso_segsz, 3) set TX offload flags, 4) hand the descriptor to the NIC, 5) let hardware segment and compute per-frame checksums.
Example: an mbuf with 64 KB TCP payload, MSS 1460, IPv4 checksum offload and TCP TSO flags becomes roughly 45 Ethernet frames on the wire. The CPU rings one TX doorbell; the NIC emits correctly checksummed frames.
13. § 9.12 — SIMD in DPDK
Overview — One Instruction, Many Packets
DPDK uses SIMD where the packet path repeats the same operation across a burst: descriptor status checks, ACL state transitions, hash bucket signature comparison, checksum arithmetic, LPM batch lookup, and optimized memory copies. The scalar mental model is still a loop over packets, but the CPU executes several lanes in parallel with SSE4.2, AVX2, or AVX-512.
Key Data Structures — SIMD-Friendly Batches
| Feature | Width | DPDK use case |
|---|---|---|
| SSE4.2 | 128-bit, 4 x 32-bit lanes | CRC32 hash acceleration, smaller ACL/classify batches |
| AVX2 | 256-bit, 8 x 32-bit lanes | Batch compares, rte_memcpy(), hash signatures, descriptor status checks |
| AVX-512 | 512-bit, 16 x 32-bit lanes | Wide ACL and lookup kernels on servers that can absorb the frequency trade-off |
| rte_cpu_get_flag_enabled() | runtime feature probe | Select the fastest safe implementation for this CPU |
| rte_mov16()/rte_mov32() | fixed-size vector copies | Move packet headers and descriptors with predictable codegen |
Core Mechanism — Batch Classification
Background: A firewall receives 32 packets from rte_eth_rx_burst(). Checking every packet through scalar branches wastes instruction bandwidth and mispredicts on mixed traffic.
Plan: 1) gather the same field from multiple mbufs, 2) load them into vector lanes, 3) compare all lanes against the rule value, 4) convert the result to a bit mask, 5) process only the matching packets.
Example: eight destination IPs are compared against 10.0.1.0/24 in one AVX2-style operation. A result mask of 0x12 means lanes 1 and 4 matched.
Minimal C Demo — Offload Flags + SIMD-Style Mask
14. § 9.13 — Virtio, vhost-user & vDPA
Overview — Paravirtual Networking for VMs
Virtio gives a VM a simple paravirtual NIC: the guest driver writes packet buffers into virtqueues, and a host backend consumes those descriptors. With vhost-user, the backend is a DPDK userspace process such as OVS-DPDK, connected to QEMU over a Unix domain socket for control and shared hugepage memory for data. With vDPA, the NIC itself implements the virtio datapath.
Key Data Structures — Split Virtqueue
| Structure | Fields | Purpose |
|---|---|---|
| Descriptor table | addr, len, flags, next | Guest-owned packet buffers; chains describe scatter-gather packets |
| Available ring | flags, idx, ring[] | Driver-to-device queue of descriptor head indexes ready for backend work |
| Used ring | flags, idx, used_elem{id,len} | Device-to-driver completion queue after packets are consumed or produced |
| Packed virtqueue | single descriptor ring plus wrap counters | Virtio 1.1 format with fewer cache misses than split rings |
| Feature bits | CSUM, MRG_RXBUF, MQ, packed-ring | Negotiated contract between virtio driver and backend |
vhost-user — Control Socket, Shared-Memory Datapath
QEMU remains the frontend, but packet movement goes through a DPDK backend. The Unix socket carries setup messages such as memory tables, vring addresses, feature bits, and eventfd file descriptors. Packet bytes stay in shared hugepage memory, so the backend can walk guest virtqueue descriptors directly and forward bursts without a kernel tap device.
vDPA — Hardware Virtio Datapath
vDPA moves the virtio data path into the NIC. Software still handles control-plane details such as feature negotiation, queue setup, and migration state, but packet DMA goes directly between the NIC and the guest virtio rings. The VM keeps the standard virtio-net driver while reaching near passthrough performance.
Core Mechanism — vhost-user Packet Receive
Background: A VM sends or receives packets through virtio-net, but the host wants DPDK throughput instead of kernel tap networking.
Plan: 1) QEMU sends guest memory and vring addresses to the DPDK backend, 2) the guest publishes descriptors in the available ring, 3) the backend maps descriptor guest physical addresses to host virtual addresses, 4) the backend copies or attaches packet data into mbufs, 5) it updates the used ring and optionally signals the guest.
Example: the guest publishes desc 5 for a 2048-byte RX buffer. OVS-DPDK reads desc 5 through shared memory, writes a packet into that buffer, places desc 5 in the used ring with the received length, then kicks the guest through eventfd if notification suppression is disabled.
Minimal C Demo — Virtqueue Descriptor Handoff
15. § 9.14 — Pipeline & Run-to-Completion Models
Overview — Two Ways to Assign Work to Cores
A DPDK datapath either keeps the full packet lifecycle on one lcore (run-to-completion) or splits work into stages connected by rte_ring queues (pipeline). The right choice is a cache-locality decision: simple L2/L3 forwarding usually prefers R2C; expensive or uneven work such as crypto, DPI, or control-plane punts often needs a pipeline.
Pipeline Model — Specialized Stages
Pipeline mode dedicates lcores to stages: RX, worker, TX, or a richer chain such as parse → ACL → crypto → route → transmit. Each handoff is a pointer enqueue/dequeue through an rte_ring, which costs cache-line traffic but lets bottleneck stages scale independently.
Key Design Trade-Offs
| Model | Strength | Cost | Best fit |
|---|---|---|---|
| Run-to-completion | No inter-core handoff; packet metadata stays hot in one cache | One lcore must perform every operation; hard to isolate slow work | L2/L3 forwarding, NAT, firewall fast path |
| Pipeline | Specialized stages; add workers only where the bottleneck is | Ring handoff, cache misses, more latency and backpressure logic | Crypto, DPI, complex service chains |
| Hybrid | R2C for common flows, pipeline for slow or exceptional flows | Two code paths and careful state sharing | Virtual switches and gateways with fast/slow path split |
Core Mechanism — Choosing R2C or Pipeline
Background: A gateway receives 32-packet bursts. Most packets only need session lookup and forwarding, but a small fraction need slow ACL and crypto work.
Plan: 1) use R2C for session hits so packet cache lines stay on the polling lcore, 2) enqueue session misses or crypto packets to a worker ring, 3) return processed packets to a TX ring, 4) measure the worker stage and add lcores only there.
Example: packets 0–29 hit the per-lcore session cache and transmit immediately on lcore 2. Packets 30–31 miss and are enqueued to a worker. The worker creates sessions, then sends those two packets to the TX lcore. The hot path avoids ring handoff for 30 out of 32 packets.
Minimal C Demo — R2C vs Pipeline Flow
16. § 9.15 — DPDK Hot Upgrade — Deep Dive
Overview — Upgrade Without Losing Datapath State
Hot upgrade is hard because a DPDK process owns hugepage memory, NIC queues, timers, session tables, and in-flight packet bursts. The practical design is to separate durable datapath state from process-local execution state: keep sessions and flow tables in named shared memzones, attach a new process to the same hugepages, validate compatibility, then swap traffic ownership with a short drain window.
Shared Memory Persistence
Hugepage files in hugetlbfs can outlive one binary execution when the application controls cleanup. A new binary using the same --file-prefix can locate named memzones, read a schema version and generation number, and import or migrate existing state. This preserves session continuity even when NIC queues must be reconfigured.
Key State Boundaries
| State | Upgrade handling | Risk |
|---|---|---|
| Session table | Store in rte_memzone with schema version and generation | Old and new layout mismatch can corrupt forwarding decisions |
| Flow table / NAT bindings | Persist keys, actions, timeout, counters; rebuild hardware rte_flow rules | Hardware rules may briefly lag software state |
| Descriptor rings | Usually process/NIC-local; drain or reinitialize rather than share blindly | Ownership transfer bugs can duplicate or lose packets |
| Timers | Serialize next expiry or rebuild from session timestamps | Expired sessions may survive too long after restart |
| In-flight bursts | Stop admission, drain rings, then swap active process | Small transition window can still drop packets |
Core Mechanism — Multi-Process Hot Upgrade
Background: A virtual switch holds millions of sessions and must upgrade the binary without rebuilding those sessions from scratch.
Plan: 1) old primary exports sessions into versioned memzones, 2) new secondary attaches to the same hugepages, 3) new process validates schema and warms flow caches, 4) old process stops accepting new sessions and drains rings, 5) traffic ownership switches, 6) old process exits after final counters are merged.
Example: version N stores each NAT session as key, translated tuple, timeout, and counters. Version N+1 reads generation 42, sees schema 1 is supported, rebuilds its per-lcore cache from the shared table, installs required hardwarerte_flow rules, and then becomes active.
Minimal C Demo — Shared State Handoff
17. § 9.16 — Session Management: Fast Path & Slow Path
Overview — First Packet Pays, Later Packets Hit Cache
Session management turns expensive per-packet decisions into a cached forwarding action. The first packet of a flow misses the table and goes through ACL, routing, NAT, and policy logic. The created session stores the 5-tuple, translated tuple, output port, timeout, and QoS class so later packets run through a per-lcore hash lookup and transmit immediately.
Slow Path — Build the Cached Decision
On a session miss, the packet leaves the hot path. The slow path may run on the same lcore for simple gateways or be enqueued to a control-plane worker through an rte_ring. The important rule is that session publication is controlled: per-lcore tables avoid locks, while global tables need RCU or writer-side locks so readers never see half-initialized state.
Key Data Structures — Session Table
| Field / Design | Purpose | Hot-path note |
|---|---|---|
| 5-tuple key | src/dst IP, src/dst port, protocol | Exact key for rte_hash or cuckoo table lookup |
| cached action | output port, NAT tuple, VLAN/VXLAN rewrite, QoS class | Avoids rerunning ACL and route lookup per packet |
| timeout / last_seen | idle aging and garbage collection | Update lazily or per-burst to avoid cache-line writes every packet |
| per-lcore table | one table per PMD lcore | No locks; flow stickiness depends on RSS or software steering |
| global RCU table | shared state across lcores | Readers are lock-free; writers build entry then publish pointer |
| timer wheel | bucket sessions by expiry tick | Aging cost is spread across ticks instead of full-table scans |
Core Mechanism — Fast/Slow Path Walkthrough
Background: A NAT gateway receives a new TCP flow. Running ACL, route lookup, and NAT allocation for every packet would waste the line-rate budget.
Plan: 1) hash the packet 5-tuple, 2) on hit apply the cached action, 3) on miss send the first packet through ACL, LPM, and NAT allocation, 4) publish the session entry, 5) send later packets through the fast path.
Example: packet 1 for 10.0.0.1:12345 → 203.0.113.8:443 misses, gets NATed to 10.0.0.9:40000, and creates a session. Packet 2 hashes to the same entry and only applies the cached rewrite and output port.
Minimal C Demo — Session Miss Then Fast-Path Hit
18. § 9.17 — QoS in DPDK
Overview — Shape, Police, and Schedule Bursts
QoS in a DPDK datapath usually has two layers. rte_meter colors packets with token bucket policers such as srTCM or trTCM. rte_sched then schedules traffic hierarchically: port, subport, pipe, traffic class, and queue. This lets one process enforce tenant, VM, or service-class limits before packets reach the TX ring.
Token Bucket Policing
A token bucket adds tokens at a configured rate and caps them at the burst size. A packet consumes tokens equal to its length. srTCM and trTCM turn that check into colors: green packets conform, yellow packets exceed committed rate but may pass at lower priority, and red packets violate policy and are dropped or remarked.
Key Data Structures — Meter and Scheduler
| Object | Fields / API | Purpose |
|---|---|---|
rte_meter_srtcm | CIR, CBS, EBS, color-aware mode | Single-rate three-color marker |
rte_meter_trtcm | CIR/PIR plus committed/peak burst sizes | Two-rate policing for committed and peak traffic |
rte_sched_port | subports, pipes, traffic classes, queues | Hierarchical scheduler state |
pipe profile | rate, TC period, queue sizes, WFQ weights | Per-tenant or per-VM shaping policy |
rte_sched_port_enqueue/dequeue | enqueue mbufs, dequeue eligible packets | Backpressure and scheduling before tx_burst |
Core Mechanism — Per-VM Rate Limit
Background: A host has many VMs behind OVS-DPDK. One VM must not consume the entire 25G port during a burst.
Plan: 1) classify packet to VM or tenant, 2) run its token bucket, 3) drop or mark red packets, 4) enqueue green/yellow packets into the scheduler pipe, 5) let rte_sched choose eligible packets for TX.
Example: VM A has a 1 Gbps committed rate and 64 KB burst. A 9 KB packet passes while tokens are available. Once the bucket drains, later packets turn yellow or red until tokens refill.
Minimal C Demo — Token Bucket Colors
19. § 9.18 — DPDK Bonding & Link Aggregation
Overview — One Logical Port Over Multiple NICs
rte_eth_bond exposes multiple physical ports as one logical ethdev. Applications call normal ethdev APIs on the bond port, while the bonding PMD chooses a slave port according to the configured mode. Active-backup is common for redundancy; 802.3ad/LACP is common when the upstream switch participates in link aggregation.
Key Modes
| Mode | Behavior | Use case |
|---|---|---|
| round-robin | Transmit packets across slaves in order | Lab aggregation; can reorder flows |
| active-backup | One active slave; fail over when link goes down | Redundancy without switch hashing |
| balance XOR | Hash packet fields to select slave | Flow-stable load distribution |
| 802.3ad / LACP | Negotiated aggregation with upstream switch | Production bandwidth plus redundancy |
| broadcast | Send every packet on every slave | Special redundancy cases, expensive |
Core Mechanism — Active-Backup Failover
Background: A virtual switch needs to survive one NIC or cable failure without changing the application datapath.
Plan: 1) create a bond ethdev, 2) add two slave ports, 3) set a primary slave, 4) poll link state, 5) switch active slave on failure, 6) keep the application using the same logical port id.
Example: port 0 is active and port 1 is standby. If port 0 link drops, the bond PMD marks port 1 active. The app still transmits to bond port 7; only the bond PMD changes which physical TX queue receives descriptors.
20. § 9.19 — OVS-DPDK: Open vSwitch with DPDK Datapath
Overview — OpenFlow Control Plane, DPDK Packet Path
OVS-DPDK keeps the OVS control plane but replaces the kernel datapath with userspace PMD threads. ovs-vswitchd owns bridges and flow programming, while netdev-dpdk ports poll physical NICs and vhost-user sockets with DPDK. The fast path is a cascade of caches: EMC, datapath classifier, and finally the full OpenFlow pipeline.
Port Types — Physical, VM, Patch, Internal, Tunnel
OVS-DPDK bridges combine several port types: dpdk for physical NICs, dpdkvhostuser for VM virtio backends, patch ports between bridges, internal ports for host networking, and tunnel ports for VXLAN or Geneve encap/decap in userspace.
Key Data Structures — OVS Datapath Caches
| Layer | Role | Performance note |
|---|---|---|
| EMC | Exact Match Cache per PMD thread | Fastest path for repeated 5-tuples; avoids tuple-space search |
| dpcls | Datapath classifier / tuple-space search | Matches megaflows with masks and cached actions |
| ofproto | Full OpenFlow pipeline | Slow path for first packet or cache miss; installs megaflow |
| megaflow | Wildcarded cached flow | Compresses many exact flows under one masked rule |
| PMD thread | Polls RX queues and vhost-user ports | Pin with pmd-cpu-mask and align queues to cores |
Core Mechanism — OVS-DPDK Flow Lookup
Background: A VM packet enters a vhost-user port. OVS must decide whether to forward to another VM, a physical NIC, a tunnel, or the OpenFlow slow path.
Plan: 1) PMD receives a burst, 2) check EMC for exact-flow hit, 3) on miss check dpcls megaflows, 4) on miss run ofproto, 5) install or update a megaflow, 6) execute actions such as output, drop, recirculate, or VXLAN encap.
Example: the first packet from VM A to VM B misses EMC and dpcls, so ofproto runs the OpenFlow table and installs a megaflow. The next packet with matching masked fields hits dpcls; repeated packets from the same exact tuple hit EMC.
21. Kernel Source Pointers
| File / Symbol | What it contains |
|---|---|
| lib/eal/linux/eal.c — rte_eal_init() | The 10-step EAL initialization sequence; calls sub-functions below |
| lib/eal/linux/eal_hugepage_info.c — eal_hugepage_init() | mmap() hugetlbfs pages, build memseg list per NUMA socket |
| lib/eal/common/eal_common_lcore.c | Lcore thread creation, CPU affinity setup (pthread_setaffinity_np) |
| lib/eal/linux/eal_pci.c — rte_pci_scan() | Enumerate /sys/bus/pci/devices/, match PCI IDs to PMD table |
| drivers/net/ixgbe/ixgbe_rxtx.c — ixgbe_recv_pkts() | ixgbe PMD rx_burst: DD bit scan, mbuf harvest, ring refill, doorbell |
| drivers/net/ixgbe/ixgbe_rxtx.c — ixgbe_xmit_pkts() | ixgbe PMD tx_burst: fill TX descs, TDT doorbell, tx_free_thresh cleanup |
| drivers/net/mlx5/mlx5_rx.c — mlx5_rx_burst() | Mellanox ConnectX PMD rx_burst using Completion Queue (CQ) model |
| lib/ethdev/rte_ethdev.c — rte_eth_rx_burst() (inline) | Dispatch: calls port->rx_pkt_burst function pointer (per-PMD hot path) |
| lib/ethdev/rte_ethdev.c — rte_eth_dev_rss_hash_update() | RSS hash configuration and RETA interaction through ethdev API |
| lib/ethdev/rte_flow.c — rte_flow_create() | Generic flow rule validation, creation, destruction API |
| lib/mempool/rte_mempool.c — rte_mempool_get_bulk() | Mempool cache refill/drain and central pool operations |
| lib/mbuf/rte_mbuf.c — rte_pktmbuf_pool_create() | Packet mbuf pool creation and mbuf object initialization |
| lib/eal/common/eal_common_memory.c — rte_malloc_socket() | NUMA-aware memory allocation helpers for per-socket data |
| lib/eal/common/eal_common_lcore.c — rte_socket_id() | Current lcore socket lookup used for locality validation |
| lib/ethdev/rte_ethdev.c — rte_eth_dev_socket_id() | NIC port NUMA socket lookup from ethdev metadata |
| lib/ring/rte_ring.c — rte_ring_create() | Ring allocation, size/mask setup, producer/consumer sync mode |
| lib/ring/rte_ring_elem_pvt.h — __rte_ring_do_enqueue_elem() | MP enqueue logic: reserve head, copy objects, publish tail |
| lib/eal/x86/include/rte_prefetch.h | Architecture-specific prefetch helpers mapping to CPU instructions |
| lib/lpm/rte_lpm.c — rte_lpm_add(), rte_lpm_lookup() | DIR-24-8 IPv4 longest prefix match table management and lookup |
| lib/lpm/rte_lpm6.c — rte_lpm6_lookup() | IPv6 multi-stride longest prefix match implementation |
| lib/acl/rte_acl.c — rte_acl_build(), rte_acl_classify() | ACL rule compilation and batch packet classification dispatch |
| lib/hash/rte_cuckoo_hash.c — rte_hash_lookup_data() | Cuckoo hash exact-match lookup for flow/session keys |
| lib/ethdev/rte_flow.c — rte_flow_create() | Generic hardware flow rule validation and creation |
| drivers/net/ixgbe/rte_pmd_ixgbe.c | Intel ixgbe PF helper APIs for VF MAC/VLAN controls |
| lib/ethdev/rte_ethdev.c — rte_eth_dev_info_get() | Advertised RX/TX offload capability discovery |
| lib/mbuf/rte_mbuf_core.h | mbuf offload flags, header length fields, packet metadata |
| lib/eal/x86/include/rte_vect.h | x86 SIMD vector typedefs and runtime CPU feature integration |
| lib/eal/x86/include/rte_memcpy.h | SIMD-optimized memory copy routines used by DPDK fast paths |
| drivers/net/virtio/virtqueue.h | Virtio ring data structures and queue helpers used by the virtio PMD |
| drivers/net/virtio/virtio_rxtx.c — virtio_recv_pkts() | Virtio PMD RX path walking virtqueue descriptors |
| lib/vhost/vhost_user.c | vhost-user protocol messages, memory table setup, vring configuration |
| lib/vhost/virtio_net.c — rte_vhost_dequeue_burst(), rte_vhost_enqueue_burst() | DPDK vhost datapath burst APIs for VM packet I/O |
| drivers/vdpa/* | vDPA device drivers that offload virtio datapath to hardware |
| lib/pipeline/rte_pipeline.c — rte_pipeline_run() | DPDK pipeline framework input ports, tables, and output ports |
| examples/l2fwd/main.c | Canonical run-to-completion forwarding loop |
| examples/ip_pipeline/ | DPDK pipeline-style datapath example |
| lib/eal/common/eal_common_proc.c | Multi-process communication primitives used by primary and secondary processes |
| lib/eal/common/eal_common_memzone.c — rte_memzone_lookup() | Named shared memory lookup for persistent state across processes |
| lib/hash/rte_cuckoo_hash.c | Exact-match session table mechanics used by fast-path flow caches |
| lib/rcu/rte_rcu_qsbr.c | Quiescent-state based RCU used for lock-free reader / controlled writer patterns |
| lib/timer/rte_timer.c | Timer management useful for session timeout and aging wheels |
| lib/meter/rte_meter.c | srTCM and trTCM token bucket color marking |
| lib/sched/rte_sched.c | Hierarchical QoS scheduler: port, subport, pipe, traffic class, queue |
| drivers/net/bonding/rte_eth_bond_pmd.c | Bonding PMD implementation and active-backup / balance mode datapath |
| lib/netdev-dpdk.c (OVS source) | OVS-DPDK netdev provider, PMD RX/TX, vhost-user integration |
| lib/dpif-netdev.c (OVS source) | OVS userspace datapath, EMC, dpcls lookup, PMD thread loop |
| lib/dpif-netdev-private*.h (OVS source) | OVS datapath cache and classifier data structures |
| drivers/bus/pci/linux/pci_uio.c | igb_uio BAR mmap: maps NIC register space into userspace process |
| drivers/bus/pci/linux/pci_vfio.c | VFIO device open, IOMMU group handling, DMA mapping via ioctl |
22. Interview Prep
| Question | Concise Answer |
|---|---|
| Why does kernel networking fail above ~1 Mpps? Name 5 bottlenecks. | 1) IRQ per packet (~2–5 µs handler + softirq). 2) sk_buff slab alloc per packet (~200 ns, cache miss). 3) copy_to_user() memcpy polluting cache. 4) recv() syscall context switch (~100–200 ns). 5) Lock contention in netfilter / routing table / socket hash under high PPS. At 10 GbE 64B frames, each packet gets only 67 ns — the kernel stack exceeds that budget. |
| Walk through rte_eal_init() — all 10 steps. | 1) Parse CLI (--lcores, --socket-mem). 2) Load PMD plugin .so files. 3) Read CPU topology from /proc/cpuinfo, build core+NUMA map. 4) mmap()+mlock() huge pages, build memseg list per NUMA socket. 5) rte_memzone_init() — named regions in huge pages. 6) Create one pthread per lcore, pin via pthread_setaffinity_np(). 7) Enumerate /sys/bus/pci/devices/, match PCI IDs to PMD table. 8) rte_pci_probe() → PMD eth_dev_init() for each matched NIC. 9) Start service cores (timer, interrupt). 10) rte_eal_mp_remote_launch() — worker functions start polling. |
| What is the DD bit and what is the doorbell in DPDK PMD? | DD (Descriptor Done): a status bit in each RX or TX descriptor that the NIC sets to 1 when it has finished with that slot (RX: packet DMA-written; TX: packet transmitted). The PMD polls DD instead of waiting for an interrupt. Doorbell: a write to a NIC BAR register (RDT for RX, TDT for TX) telling the NIC the new head/tail pointer — i.e., how many new buffers the PMD has made available. In DPDK, the doorbell is batched once per rx_burst/tx_burst call to amortize the PCIe transaction cost. |
| What is the difference between UIO and VFIO? When should you use VFIO? | UIO (igb_uio.ko): exposes /dev/uioN; DPDK mmap()s BAR directly. No IOMMU — the DMA address space is physical memory, so a bug lets the NIC DMA anywhere. Requires root. VFIO (vfio-pci.ko): groups the device by IOMMU domain; DPDK registers hugepage memory with the IOMMU via VFIO_MAP_DMA ioctl; the NIC can only DMA into those regions. Preferred for production (SR-IOV VFs, containers, secure multi-tenant environments). Use VFIO whenever the system has an IOMMU and you care about isolation. |
| Why does rte_eth_rx_burst() process packets in batches of 32? | Three reasons: 1) Amortize doorbell write — one PCIe RDT update per 32 packets costs ~3 ns/pkt vs ~100 ns/pkt for per-packet updates. 2) Prefetch pipeline — prefetching 3–4 mbufs ahead while processing the current one hides DRAM latency; 32 fits without L1 overflow. 3) SIMD alignment — 32 × 64-bit descriptors = 256 bytes, checkable with 4 AVX2 registers in one pass for the DD bit. |
| How does RSS map a packet to an RX queue? | The NIC hashes selected header fields, usually the 5-tuple, with a Toeplitz RSS key. The low bits of that hash index the RETA, and the RETA entry contains the RX queue id. Software can rebalance by changing RETA entries. The same flow maps to the same queue, preserving packet order inside the flow. |
| When do you use rte_flow instead of RSS? | Use RSS for broad load distribution across queues. Use rte_flow when you need exact steering or actions: send tenant X to queue 7, drop a port, mark packets, count matches, or offload tunnel encap/decap. rte_flow rules are validated against NIC capabilities; unsupported patterns may fail or fall back to software depending on PMD. |
| Why does DPDK require huge pages and mempools? | Huge pages reduce TLB pressure and provide pinned DMA memory registered with VFIO/IOMMU. Mempools avoid malloc/free on the hot path by pre-allocating fixed-size mbufs. Per-lcore caches serve most allocations without atomics; the central ring is touched only on bulk refill/drain. |
| Explain rte_mbuf fields that matter in RX/TX. | buf_addr is the virtual base, buf_iova is the DMA address used in descriptors, data_off points to packet start after headroom, data_len is bytes in this segment, pkt_len is total packet length across chained segments, next links multi-segment packets, ol_flags carries checksum/TSO/VLAN offload state, and packet_type carries parser results from the NIC or PMD. |
| What is NUMA-aware programming in DPDK? What happens if you violate it? | Place the NIC, mempool, descriptor rings, polling lcore, and hot per-lcore data on the same socket. Use rte_eth_dev_socket_id(port), rte_socket_id(), and socket_id arguments to enforce it. If a socket-0 NIC is polled by a socket-1 lcore, packet bytes and mbuf metadata cross UPI/QPI every packet, often causing a 30–50% throughput drop. |
| Explain the rte_ring multi-producer enqueue algorithm. | A producer loads prod.head and cons.tail, checks free space, computes new_head, then CAS-es prod.head from old_head to new_head to reserve slots. It copies objects into ring[index & mask], issues a write barrier, spins until prod.tail equals old_head so predecessors publish first, then stores prod.tail = new_head. That split between head reservation and tail publication preserves FIFO order without a mutex. |
| Why are prod and cons fields in rte_ring cache-line aligned? | Producers frequently write prod.head/prod.tail and consumers frequently write cons.head/cons.tail. If those fields share one cache line, every enqueue/dequeue bounces the line between cores. __rte_cache_aligned separates them so producer writes do not invalidate consumer-owned cache lines. |
| How does DPDK use prefetch and DDIO in the receive path? | DDIO lets the NIC DMA packet bytes into LLC instead of only DRAM. The PMD then uses rte_prefetch0() a few mbufs ahead, commonly 3–4, so metadata and packet data are pulled into L1 before parsing. The goal is to overlap cache miss latency with useful work on current packets. |
| What is DIR-24-8 in rte_lpm, and why is it fast? | DIR-24-8 uses the top 24 IPv4 bits as a direct table index. Prefixes /0 through /24 return in one memory access. More specific /25 through /32 prefixes use a second 256-entry tbl8 group indexed by the low 8 bits. That makes common route lookup one or two dependent loads instead of walking a long trie. |
| Compare rte_acl, rte_lpm, and rte_hash. | rte_lpm is longest-prefix route lookup, usually destination IP to next hop. rte_acl is ordered multi-field rule classification over fields like IPs, ports, protocol, and priority. rte_hash is exact-match lookup for full keys such as a 5-tuple session. In a datapath, a miss may go ACL -> LPM -> NAT -> create exact session in rte_hash. |
| How does SR-IOV differ from multi-queue on one PF? | Multi-queue gives one PCI function many RX/TX queues, usually owned by one DPDK process. SR-IOV creates separate PCI Virtual Functions, each with its own BDF, queues, and VFIO/IOMMU isolation. The PF configures policy such as VF MAC/VLAN filters, while each VF can be passed to a VM or container for near-native DMA. |
| How do checksum and TSO offloads work in DPDK? | The application enables port offload capabilities, then marks each mbuf with ol_flags and sets l2_len/l3_len/l4_len. For TSO it also sets tso_segsz. The PMD writes TX descriptors containing those fields, and the NIC segments large TCP payloads into MTU-sized frames and computes IPv4/TCP checksums per frame. |
| Where does SIMD help in DPDK? | SIMD helps wherever the same operation is repeated across a burst: ACL DFA state transitions, hash bucket signature comparison, LPM batch lookup, checksum arithmetic, descriptor DD-bit checks, and rte_memcpy. DPDK probes CPU flags at runtime and selects SSE4.2, AVX2, AVX-512, or scalar implementations. |
| Explain virtio split virtqueue and vhost-user. | A split virtqueue has a descriptor table, an available ring written by the guest driver, and a used ring written by the backend. vhost-user moves the backend into userspace: QEMU sends vring addresses, memory tables, and eventfds over a Unix socket, while DPDK reads and writes guest buffers through shared hugepage memory using rte_vhost_dequeue_burst() and rte_vhost_enqueue_burst(). |
| How is vDPA different from vhost-user? | vhost-user uses a software DPDK backend to walk virtio descriptors. vDPA keeps the guest-facing virtio interface but offloads the data path to hardware: the NIC DMA-reads and writes guest virtio rings directly, while software handles control plane, feature negotiation, and migration hooks. |
| Compare run-to-completion and pipeline models. | R2C keeps rx_burst, parsing, classification, modification, and tx_burst on one lcore. It is simple and cache-local, best for forwarding/NAT/firewall fast paths. Pipeline splits stages across lcores connected by rte_ring; it adds handoff latency and cache misses but lets expensive stages like crypto or DPI scale independently. |
| How would you implement a DPDK hot upgrade? | Store durable state such as sessions and flow tables in named shared memzones with schema versions and generation numbers. Start the new binary as a secondary attached to the same --file-prefix hugepages, validate and import state, warm caches and hardware flow rules, stop admission on the old process, drain rings, swap traffic ownership, merge counters, then exit the old process. |
| Explain DPDK session fast path and slow path. | The fast path hashes the packet 5-tuple into a per-lcore or RCU-protected session table. On hit it applies cached actions such as NAT rewrite, output port, and QoS class, then transmits. On miss the first packet goes through ACL, route lookup, NAT allocation, and policy; the result is published as a session so later packets avoid the expensive work. |
| How would you age sessions without hurting packet throughput? | Avoid writing shared timeout fields on every packet. Use lazy refresh, per-lcore counters, or timestamp updates once per burst/window. Put sessions into a timer wheel or expiry buckets so cleanup scans only the current bucket. With global tables, delete through RCU or a quiescent-state scheme so readers never see freed entries. |
| How do rte_meter and rte_sched differ? | rte_meter is a policer/marker: it uses token bucket logic such as srTCM or trTCM to color packets green, yellow, or red. rte_sched is a hierarchical scheduler: packets are enqueued into port/subport/pipe/traffic-class/queue levels, then dequeued according to rate, priority, and WFQ rules before tx_burst. |
| When would you use DPDK bonding active-backup versus 802.3ad? | Use active-backup when you need simple NIC redundancy without depending on switch-side aggregation; only one slave carries traffic and failover switches to the standby. Use 802.3ad/LACP when the upstream switch participates and you want both redundancy and aggregate bandwidth with flow-stable hashing. |
| Describe the OVS-DPDK lookup pipeline. | A PMD thread receives a burst from a dpdk or dpdkvhostuser port. It first checks EMC, the per-PMD exact match cache. On miss it checks dpcls tuple-space megaflows. On dpcls miss it runs the ofproto/OpenFlow slow path, computes actions, and installs or updates a megaflow. Later exact repeated tuples hit EMC; related wildcarded traffic hits dpcls. |
| What OVS-DPDK tuning knobs matter in production? | Pin PMD threads with other_config:pmd-cpu-mask, align RX queues with options:n_rxq and NUMA locality, use huge pages and isolated CPUs, tune EMC insertion probability for flow churn, keep vhost-user ports on local NUMA nodes, and verify queue-to-PMD assignment with ovs-appctl dpif-netdev/pmd-rxq-show. |