Part IX — DPDK

§ 9.1 – 9.19 DPDK: Kernel Bypass · EAL · PMD · RSS · Mbuf · NUMA · rte_ring · ACL/LPM · SR-IOV · Offload · SIMD · virtio · Pipeline · Hot Upgrade · Session · QoS · Bonding · OVS-DPDK

Why kernel networking fails at line rate (§9.1) · EAL startup 10-step flow (§9.2) · PMD RX/TX rings (§9.3) · multi-queue RSS (§9.4) · huge pages and mbufs (§9.5) · NUMA locality, lock-free rings, cache optimization, hardware classification, SR-IOV, NIC offloads, SIMD batching, virtio/vhost-user, R2C/pipeline models, hot upgrade, session fast/slow paths, QoS, bonding, and OVS-DPDK (§9.6–§9.19)

1. Overview

DPDK (Data Plane Development Kit) eliminates every per-packet overhead in the Linux kernel networking stack: interrupts, system calls, sk_buff allocation, copy_to_user(), and lock contention in netfilter and the routing table. Instead, a Poll Mode Driver (PMD) runs entirely in userspace, mapping NIC registers directly via UIO or VFIO, and spins in a tight loop checking descriptor DD (Descriptor Done) bits — no interrupts, no syscalls, no copies.

The result: from ~1 Mpps maximum in the kernel to 14.88 Mpps line rate at 64-byte frames on a single 10 GbE port with a single core.

2. § 9.1 — Why DPDK: Kernel Bypass Architecture

The Six Kernel Networking Bottlenecks

At 10 Gbps with 64-byte frames, the NIC delivers 14.88 million packets per second. Each packet gets only 67 nanoseconds of CPU time. The Linux kernel path burns most of that budget before the application even sees the data.

BottleneckCostDPDK Elimination
IRQ per packet~2–5 µs total (handler + softirq schedule)Poll mode — DD bit check, zero interrupts
sk_buff allocationslab allocator per packet, ~200 ns, cache missPre-allocated mbuf pool in huge pages, O(1) from per-lcore cache
copy_to_user()memcpy kernel→user, pollutes cachePacket stays in huge-page mbuf, app reads in-place — zero copy
recv() syscallcontext switch ~100–200 nsNo syscall — PMD loop is pure userspace
Lock contentionnetfilter, routing table, socket hash under high PPSNo kernel stack at all — app owns the data path
Cache pollutionkernel stack traversal touches many cold linesHuge pages + DDIO → packets land in LLC before CPU reads them

UIO vs VFIO vs AF_XDP — Bypass Mechanisms Compared

The NIC must be detached from its kernel driver and handed to userspace. Three mechanisms exist, each with different security and performance trade-offs.

MechanismModuleIOMMURoot neededContainer-safeBest for
UIO (igb_uio)igb_uio.koNo — DMA unrestrictedYesNoDev/test, trusted bare-metal
VFIO (vfio-pci)vfio-pci.koYes — IOMMU group isolationYes (or CAP_SYS_ADMIN)YesProduction, SR-IOV VFs, containers
AF_XDP (kernel)built-inKernel handles DMANo (CAP_NET_RAW)YesKeep kernel features, selectively accelerate
Kernel PMD (af_packet)built-inN/ANoYesDebug, low PPS, compatibility

How VFIO Works

VFIO groups devices by IOMMU group (devices that share an IOMMU context must be in the same group). The workflow: unbind the NIC from its kernel driver → echo vfio-pci > /sys/bus/pci/.../driver_override → DPDK EAL opens /dev/vfio/<group> → calls VFIO_MAP_DMA ioctl to register hugepage memory with the IOMMU → NIC can only DMA into registered regions. UIO skips the IOMMU entirely: faster to set up, but a buggy (or malicious) userspace process can DMA to any physical address.

3. § 9.2 — DPDK Startup Flow: EAL Initialization

rte_eal_init() — 10-Step Sequence

Every DPDK application calls rte_eal_init(argc, argv) as its first act. This one call bootstraps the entire DPDK runtime: CPU topology, memory, devices, and worker threads. If it returns a negative number, the application must exit — the environment is not usable.

CLI FlagPurpose
--lcores 0-3Use logical cores 0, 1, 2, 3 — EAL creates one pthread per lcore
--socket-mem 4096,4096Allocate 4 GB of huge pages on NUMA socket 0 and socket 1
--file-prefix myappNamespace for hugepage files — allows multiple DPDK instances on same host
--proc-type primaryThis process owns hugepages and devices; secondary processes attach later
--proc-type secondaryAttach to an existing primary's shared memory (hot upgrade pattern)
--allow 01:00.0Whitelist (probe) only this PCI device; all others are ignored
--vdev net_ring0Create a virtual device (ring PMD) — useful for testing without real NIC

Multi-Process Mode — Shared Hugepage Memory

DPDK supports running multiple cooperating processes on one host. The primary process allocates hugepages and initializes devices; one or more secondary processes attach by mmap()-ing the same hugepage files. Data structures stored in named rte_memzone regions (mempool, rings, flow tables) are accessible from both — at the same virtual addresses, because DPDK maps the files at a fixed base address. This is the foundation of DPDK hot upgrade.

Minimal C Demo — EAL Startup Simulation

Real rte_eal_init() requires DPDK libraries and hardware. This simulation traces the same 10 steps in plain C so you can follow the sequence mentally.

EAL Initialization — 10-step startup simulation — C Demo
stdin (optional)

4. § 9.3 — Poll Mode Driver (PMD) — Deep Dive

PMD Initialization Sequence

After EAL init, the application configures each NIC port in three steps: set queue counts and offload flags, allocate descriptor rings, then start the device. Each step maps directly to a NIC register write via the mapped BAR.

API CallWhat it does to the NIC
rte_eth_dev_configure(port, nb_rxq, nb_txq, &conf)Writes NIC control registers: queue count, RSS enable, offload flags
rte_eth_rx_queue_setup(port, q, nb_desc, socket, &rxconf, mp)Allocates desc ring (DMA-coherent), fills each desc with a mempool mbuf's iova
rte_eth_tx_queue_setup(port, q, nb_desc, socket, &txconf)Allocates TX desc ring; sets tx_free_thresh (batch-free completed mbufs)
rte_eth_dev_start(port)Enables NIC, configures MAC filter, enables RX/TX, links up
rte_eth_rx_burst(port, q, mbufs, 32)Hot path: scans DD bits, harvests up to 32 mbufs, refills ring, rings doorbell
rte_eth_tx_burst(port, q, mbufs, n)Hot path: fills TX descs with mbuf iovas, rings TDT doorbell, checks tx_free_thresh

RX Descriptor Ring — NIC Fills, PMD Drains

The RX ring is a circular array of fixed-size descriptors in DMA-accessible memory (inside huge pages). The PMD pre-fills every slot with the physical address (buf_iova) of an empty mbuf from the mempool. When a packet arrives, the NIC DMA-writes the packet bytes into the pointed-to buffer and sets DD=1. The PMD polls, harvests completed descriptors, and immediately refills each slot with a fresh mbuf before ringing the doorbell (writing the new tail index to the RDT BAR register).

TX Descriptor Ring — PMD Fills, NIC Drains

TX is symmetric. The PMD fills each descriptor with the outgoing mbuf's buf_iova, the packet length, and command flags (EOP = end of packet, RS = report status, IFCS = insert CRC). It then writes the new tail to the TDT BAR register (the TX doorbell). The NIC DMA-reads the packet bytes over PCIe and transmits. The PMD reclaims completed descriptors (DD=1) in batches of tx_free_thresh (default 32) to amortize the mempool free cost.

PMD RX Burst — Step-by-Step Code Path

Burst Design — Why 32 Packets?

rte_eth_rx_burst() processes up to 32 packets per call. This is not arbitrary:

  • Amortizes the doorbell write — one PCIe transaction to update RDT costs ~100 ns; doing it once per 32 packets costs 3 ns per packet amortized.
  • Fits in a cache line prefetch window — with a prefetch-ahead distance of 3–4, 32 descriptors keep the CPU pipeline full without exceeding L1 capacity.
  • Aligns with SIMD width — 32 × 64-bit descriptors = 256 bytes, fitting in 4 AVX2 registers for batch DD-bit checking.

Minimal C Demo — PMD RX Descriptor Ring

PMD RX ring — pre-fill, NIC receive, burst harvest — C Demo
stdin (optional)

Minimal C Demo — PMD TX Path

PMD TX path — fill descriptors, doorbell, batch-free — C Demo
stdin (optional)

5. § 9.4 — NIC Multi-Queue, RSS & Flow Classification

Overview — One Queue per Core, No Locks

Modern NICs expose dozens or hundreds of RX/TX queues. Each queue is an independent descriptor ring, so the standard DPDK design is queue-to-lcore ownership: lcore 0 polls queue 0, lcore 1 polls queue 1, and so on. No two cores mutate the same RX ring, the application avoids locks on the hot path, and packet ordering is preserved inside a flow because RSS sends the same 5-tuple to the same queue.

Key Data Structures — RSS RETA and Flow Rules

Structure / APIFields that matterPurpose
RSS key40 bytes on many NICsSeed used by Toeplitz hash; changing it changes distribution
RSS hash inputsrc/dst IP, src/dst port, protocolStable flow identity; same 5-tuple maps to same queue
RETA128 or 512 entries, each entry = queue idhash index → RX queue; lets software rebalance queues without changing hash
rte_flow patterneth, ipv4, tcp, udp, vxlan, masksExact or masked match over packet headers
rte_flow actionqueue, rss, drop, mark, count, encap, decapHardware action taken before packet DMA reaches memory

Core Mechanism — RSS Queue Selection

Background: A 100 Gbps NIC cannot push all packets through one RX ring. It must spread flows across cores without reordering packets inside one TCP connection.

Plan: 1) hash the packet 5-tuple in hardware, 2) mask the hash into the RSS indirection table, 3) read the selected queue id, 4) DMA the packet into that queue's descriptor ring, 5) let the owning lcore poll it.

Example: A TCP flow 10.0.0.1:12345 → 10.0.0.2:443 hashes to 0x91ab0025. If the RETA has 128 entries, index 0x25 is read; if that entry contains queue 3, every packet in that flow is DMA-written into RX queue 3 and processed by lcore 3.

rte_flow — Precise Hardware Classification

RSS is probabilistic load distribution. rte_flow is explicit steering: match a packet pattern, then run an action such as queue, drop, mark, count, RSS, encap, or decap. Smart NIC projects use this to steer control-plane traffic, tenant ports, or tunnel flows before the packet ever touches CPU caches.

Minimal C Demo — RSS Queue Selection

RSS — 5-tuple hash to RETA queue — C Demo
stdin (optional)

6. § 9.5 — Memory Management: Huge Pages, Mempool & Mbuf

Overview — Pre-Allocate Everything the NIC Will Touch

DPDK avoids runtime allocation in the packet path. EAL maps pinned huge pages, builds NUMA-local memzones, then creates rte_mempool objects full of fixed-size rte_mbuf packet buffers. RX descriptors point directly at mbuf data rooms by IOVA, so the NIC can DMA packets into memory that userspace already owns.

Key Data Structures — Huge Page, Mempool, Mbuf

ObjectFields / ParametersPurpose
Huge page2 MB or 1 GB, mmap()+mlock(), registered with VFIOLarge pinned DMA memory; reduces TLB misses and prevents swap
rte_mempoolname, n, cache_size, priv_size, data_room_size, socket_idFixed-size object allocator, usually for mbufs
Per-lcore cachearray of object pointers, usually 256-512 entriesFast alloc/free without CAS or shared cache-line bouncing
Central ringlock-free rte_ring backendBulk refill/drain path when local cache is empty or full
rte_mbufbuf_addr, buf_iova, data_off, data_len, pkt_len, next, ol_flagsPacket metadata plus data room used by NIC and application

rte_mbuf Layout

An mbuf begins with hot metadata, then packet headroom, packet bytes, and tailroom. The packet pointer is not always buf_addr; DPDK computes it as buf_addr + data_off, which is what rte_pktmbuf_mtod() returns.

Core Mechanism — Mempool Fast Path

Background: At 20 Mpps, a normal malloc/free per packet would destroy throughput through locks, metadata writes, and cache misses.

Plan: 1) allocate all mbufs at startup from huge pages, 2) let each lcore allocate from its local cache, 3) refill or drain in bulk from the central ring, 4) return mbufs to the same NUMA socket whenever possible.

Example: lcore 2 starts with 4 cached mbufs. It receives a burst of 32 packets, consumes its 4 local objects, then bulk-pulls 32 more from the central ring. The shared ring CAS cost is paid once for the batch, not once per packet.

Multi-Segment Mbuf Chain

Jumbo frames, TSO, and scatter-gather I/O use chained mbufs. The first mbuf storespkt_len for the whole packet and nb_segs for the chain length; each segment stores its own data_len. The NIC can transmit the chain by DMA reading each segment address, avoiding a linear copy.

Minimal C Demo — Mempool + Mbuf Fast Path

Mempool and mbuf — cache refill and data pointer — C Demo
stdin (optional)

7. § 9.6 — NUMA-Aware Programming in DPDK

Overview — Keep NIC, Queue, Mempool, and Lcore on One Socket

NUMA is not a small tuning detail in DPDK. A packet path touches the NIC DMA engine, the RX descriptor ring, the mbuf metadata, the packet bytes, and the polling lcore. If any of those live on the wrong socket, every packet pays an inter-socket hop. At high PPS that becomes a throughput cliff, commonly a 30–50% drop.

Wrong Placement — Remote Memory on Every Packet

The classic mistake is binding a NIC on socket 0, allocating its mempool on socket 0, but polling it from a lcore on socket 1. The NIC DMA is local, but the CPU reads packet bytes and updates mbuf metadata through the interconnect. The fix is mechanical: use rte_eth_dev_socket_id(port) for the NIC, allocate with that socket_id, and assign only lcores whose rte_socket_id() matches.

Key Data Structures — NUMA Placement Inputs

API / FileReturnsHow to use it
rte_eth_dev_socket_id(port)NUMA socket of the PCI deviceChoose mempool socket and lcore set for this port
rte_socket_id()NUMA socket of the current lcoreValidate that the polling lcore is local to the port
/sys/bus/pci/devices/<BDF>/numa_nodeKernel view of the NIC's NUMA nodeDebug bad topology or BIOS/ACPI reporting issues
rte_malloc_socket(size, align, socket)Memory from a specific socketAllocate per-lcore flow tables, stats, rings, and queues locally
rte_ring_create(name, count, socket, flags)Ring metadata and slots on a socketFor cross-socket rings, bias toward the consumer socket

Core Mechanism — Locality Walkthrough

Background: A dual-socket host has port 0 attached to socket 0. You need to decide where to allocate the mempool and which lcore should poll RX queue 0.

Plan: 1) read the NIC socket, 2) allocate the mempool and descriptor rings on that socket, 3) pin the polling lcore to the same socket, 4) keep per-lcore tables and stats on the same socket, 5) use rings only when a packet must cross sockets.

Example: port 0 reports socket 0. Mempool mp0 is created with socket 0. lcore 2 also reports socket 0, so it polls queue 0. Packet bytes DMA into socket 0 memory, DDIO places them in the local LLC, and lcore 2 reads them without a remote hop.

Minimal C Demo — NUMA Placement Check

NUMA placement — NIC, mempool, lcore locality — C Demo
stdin (optional)

8. § 9.7 — Lock-Free Ring Buffer: rte_ring Deep Dive

Overview — Bounded FIFO Without Locks

rte_ring is DPDK's shared queue primitive. It is a power-of-two circular array of object pointers with separate producer and consumer cursors. Multi-producer mode uses CAS to reserve a range of slots, then publishes the range by advancing prod.tail. Single-producer and single-consumer modes remove the CAS and become plain loads/stores plus barriers.

Key Data Structures — Head/Tail Split

FieldWriterPurpose
prod.headproducer CAS or storeReservation cursor; producers claim slots by moving this first
prod.tailproducer store after copyPublication cursor; consumers cannot see objects until this advances
cons.headconsumer CAS or storeReservation cursor for dequeue
cons.tailconsumer store after readPublication cursor; producers use it to calculate free space
maskconstantWrap index with index & mask instead of slow modulo
ring[]producers write, consumers readContiguous array of void* object pointers

Core Mechanism — Multi-Producer Enqueue

Background: Two worker lcores need to enqueue packets to one TX lcore without a mutex. Both may enter the enqueue path at the same time.

Plan: 1) reserve slots by CAS-ing prod.head, 2) copy objects into the reserved slots, 3) issue a write barrier, 4) wait until any earlier producer has published, 5) advance prod.tail.

Example: producer A reserves slots 8–11 and producer B reserves slots 12–15. B may finish copying first, but it cannot publish tail 16 until A publishes tail 12. That spin-wait preserves FIFO order for the consumer.

Minimal C Demo — MPMC Enqueue

rte_ring MPMC enqueue — CAS reserve, ordered publish — C Demo
stdin (optional)

9. § 9.8 — Cache Optimization in DPDK

Overview — Performance Is Cache-Line Ownership

DPDK's hot path is designed around cache lines: align frequently written fields, avoid false sharing, prefetch packet metadata before use, and keep data owned by the lcore that mutates it. At 20 Mpps, a single remote cache-line transfer can cost more than the useful packet work.

Key Techniques — Alignment, Prefetch, MESI, DDIO

TechniqueWhat it preventsDPDK pattern
__rte_cache_alignedFalse sharing between unrelated hot fieldsPlace prod and cons ring cursors on separate 64-byte cache lines
rte_prefetch0()Waiting on DRAM or LLC when packet is first touchedPrefetch mbuf i+3 while processing mbuf i
Per-lcore dataMESI S->I invalidation traffic across coresStats, flow caches, and scratch buffers owned by one lcore
DDIOPacket DMA landing only in DRAMNIC writes packet bytes into LLC, so PMD reads at cache latency
Write combiningMany small PCIe doorbell writesBatch tail updates and map BAR regions WC where supported

Core Mechanism — Prefetch Pipeline

Background: The PMD receives a burst of 32 mbufs. Each mbuf points to packet bytes that may be in LLC because of DDIO, but the metadata and payload still need to be pulled into L1 before parsing.

Plan: 1) process the current mbuf, 2) prefetch a future mbuf 3–4 packets ahead, 3) keep the CPU doing useful parsing while the cache hierarchy fetches the future packet, 4) tune the distance so it hides latency without evicting useful data.

Example: while parsing mbuf[0], the loop prefetches mbuf[3]. By the time the loop reachesmbuf[3], its first cache lines are already in L1.

Minimal C Demo — Prefetch-Ahead Loop

Cache optimization — prefetch-ahead burst loop — C Demo
stdin (optional)

10. § 9.9 — ACL & LPM Classification Libraries

Overview — Classify Packets Before the Slow Path

Fast packet processing is mostly classification: decide which route, tenant, ACL rule, or session a packet belongs to without touching a long chain of branches. DPDK provides three core libraries for that job: rte_lpm for IPv4 longest-prefix route lookup, rte_acl for multi-field packet rules, and rte_hash for exact-match flow state.

Key Data Structures — LPM, ACL, Hash

LibraryImportant fields / shapePurpose
rte_lpmtbl24[2^24] plus optional tbl8 groups; entry has valid, ext_entry, depth, next_hopIPv4 route lookup in one or two memory accesses
rte_lpm6multi-stride trie for 128-bit IPv6 prefixesIPv6 route lookup without a 2^64 direct table
rte_aclcompiled DFA/trie from rules over src/dst IP, ports, protocol, priorityBatch ACL lookup using SIMD state transitions
rte_hashcuckoo buckets, signatures, key store, optional data pointerExact 5-tuple/session lookup with two candidate buckets

Core Mechanism — DIR-24-8 Longest Prefix Match

Background: A router must map every destination IP to the most specific route. A trie works but costs several dependent memory loads per packet.

Plan: 1) use the top 24 bits as a direct array index, 2) return immediately for /0 through /24 routes, 3) follow one extra 256-entry table only for /25 through /32 routes, 4) store the selected next hop in the matching entry.

Example: route 203.0.113.0/24 has next hop 3. A more specific 203.0.113.200/32 has next hop 7. The first 24 bits index tbl24; theext_entry bit tells lookup to read the low 8 bits from tbl8 and return 7 for that one host.

Cuckoo Hashing — Exact Flow Lookup

rte_hash uses cuckoo hashing: every key has two candidate buckets, each bucket stores compact signatures for SIMD-friendly comparison, and insertions relocate existing keys only when both buckets are full. This keeps lookup predictable: compute two hashes, compare bucket signatures, then verify the full key on a signature hit.

Minimal C Demo — DIR-24-8 Lookup

LPM DIR-24-8 — direct table plus /32 exception — C Demo
stdin (optional)

11. § 9.10 — SR-IOV in DPDK

Overview — Split One NIC into Hardware Tenants

SR-IOV exposes one physical NIC as a Physical Function (PF) plus many Virtual Functions (VFs). The PF owns full device control: VF creation, MAC/VLAN filters, link settings, and policy. Each VF appears as its own PCIe function with private RX/TX queues and can be bound to vfio-pci for a DPDK app, container, or VM passthrough datapath.

Key Data Structures — PF, VF, Queue, IOMMU Domain

ObjectFields / ownershipPurpose
PFfull PCIe function, admin queues, VF control registersCreates VFs and programs per-VF policy
VFseparate BDF, queue pairs, MAC/VLAN filters, limited registersTenant-facing datapath with near-native DMA
IOMMU groupVFIO container, group fd, DMA mappingsPrevents one VF from DMA-writing another tenant's memory
PF flow policyMAC, VLAN, ethertype, queue, VF idSteers hardware-classified traffic to the right VF

Core Mechanism — VF Passthrough Walkthrough

Background: A VM needs low-latency networking, but assigning the whole PF would give it control over every tenant on the NIC.

Plan: 1) enable VFs from the PF, 2) bind a VF to vfio-pci, 3) map guest or DPDK hugepage memory through the IOMMU, 4) configure VF MAC/VLAN rules from the PF, 5) let the VF poll its private queues directly.

Example: VF 3 is assigned to tenant A with MAC 52:54:00:aa:00:03 and VLAN 120. The PF programs filters so only those packets enter VF 3 queues; VF 3 DMA is restricted to tenant A memory by the IOMMU.

12. § 9.11 — Hardware Offload

Overview — Move Mechanical Packet Work into the NIC

Hardware offload removes repetitive per-packet work from the CPU: checksum generation, checksum verification, TCP segmentation, tunnel checksum handling, timestamping, and flow steering. In DPDK, the application still owns the packet; it marks the mbuf withol_flags and length fields so the PMD can describe the operation in the TX descriptor.

Key Data Structures — Offload Contract

Field / flagDirectionMeaning
RTE_ETH_TX_OFFLOAD_IPV4_CKSUMTX configPort capability bit: NIC can compute IPv4 header checksum
RTE_MBUF_F_TX_TCP_CKSUMmbuf TXThis packet needs TCP checksum computed by NIC
RTE_MBUF_F_TX_TCP_SEGmbuf TXThis packet is a large TCP segment and needs TSO
l2_len / l3_len / l4_lenmbuf TXHeader boundaries; NIC needs them to find checksum fields
tso_segszmbuf TXMSS used by NIC when splitting a large segment
RTE_MBUF_F_RX_IP_CKSUM_GOODmbuf RXNIC verified checksum and reported success
RTE_ETH_RX_OFFLOAD_TIMESTAMPRX configNIC writes hardware timestamp metadata for latency/PTP

Core Mechanism — TSO and Checksum Offload

Background: A TCP sender wants to transmit 64 KB of payload. If the CPU splits it into MTU-sized packets and computes every checksum, it burns cycles on work the NIC can do while transmitting.

Plan: 1) build one large mbuf chain, 2) set header lengths andtso_segsz, 3) set TX offload flags, 4) hand the descriptor to the NIC, 5) let hardware segment and compute per-frame checksums.

Example: an mbuf with 64 KB TCP payload, MSS 1460, IPv4 checksum offload and TCP TSO flags becomes roughly 45 Ethernet frames on the wire. The CPU rings one TX doorbell; the NIC emits correctly checksummed frames.

13. § 9.12 — SIMD in DPDK

Overview — One Instruction, Many Packets

DPDK uses SIMD where the packet path repeats the same operation across a burst: descriptor status checks, ACL state transitions, hash bucket signature comparison, checksum arithmetic, LPM batch lookup, and optimized memory copies. The scalar mental model is still a loop over packets, but the CPU executes several lanes in parallel with SSE4.2, AVX2, or AVX-512.

Key Data Structures — SIMD-Friendly Batches

FeatureWidthDPDK use case
SSE4.2128-bit, 4 x 32-bit lanesCRC32 hash acceleration, smaller ACL/classify batches
AVX2256-bit, 8 x 32-bit lanesBatch compares, rte_memcpy(), hash signatures, descriptor status checks
AVX-512512-bit, 16 x 32-bit lanesWide ACL and lookup kernels on servers that can absorb the frequency trade-off
rte_cpu_get_flag_enabled()runtime feature probeSelect the fastest safe implementation for this CPU
rte_mov16()/rte_mov32()fixed-size vector copiesMove packet headers and descriptors with predictable codegen

Core Mechanism — Batch Classification

Background: A firewall receives 32 packets from rte_eth_rx_burst(). Checking every packet through scalar branches wastes instruction bandwidth and mispredicts on mixed traffic.

Plan: 1) gather the same field from multiple mbufs, 2) load them into vector lanes, 3) compare all lanes against the rule value, 4) convert the result to a bit mask, 5) process only the matching packets.

Example: eight destination IPs are compared against 10.0.1.0/24 in one AVX2-style operation. A result mask of 0x12 means lanes 1 and 4 matched.

Minimal C Demo — Offload Flags + SIMD-Style Mask

Offload and SIMD — mbuf flags plus batch match mask — C Demo
stdin (optional)

14. § 9.13 — Virtio, vhost-user & vDPA

Overview — Paravirtual Networking for VMs

Virtio gives a VM a simple paravirtual NIC: the guest driver writes packet buffers into virtqueues, and a host backend consumes those descriptors. With vhost-user, the backend is a DPDK userspace process such as OVS-DPDK, connected to QEMU over a Unix domain socket for control and shared hugepage memory for data. With vDPA, the NIC itself implements the virtio datapath.

Key Data Structures — Split Virtqueue

StructureFieldsPurpose
Descriptor tableaddr, len, flags, nextGuest-owned packet buffers; chains describe scatter-gather packets
Available ringflags, idx, ring[]Driver-to-device queue of descriptor head indexes ready for backend work
Used ringflags, idx, used_elem{id,len}Device-to-driver completion queue after packets are consumed or produced
Packed virtqueuesingle descriptor ring plus wrap countersVirtio 1.1 format with fewer cache misses than split rings
Feature bitsCSUM, MRG_RXBUF, MQ, packed-ringNegotiated contract between virtio driver and backend

vhost-user — Control Socket, Shared-Memory Datapath

QEMU remains the frontend, but packet movement goes through a DPDK backend. The Unix socket carries setup messages such as memory tables, vring addresses, feature bits, and eventfd file descriptors. Packet bytes stay in shared hugepage memory, so the backend can walk guest virtqueue descriptors directly and forward bursts without a kernel tap device.

vDPA — Hardware Virtio Datapath

vDPA moves the virtio data path into the NIC. Software still handles control-plane details such as feature negotiation, queue setup, and migration state, but packet DMA goes directly between the NIC and the guest virtio rings. The VM keeps the standard virtio-net driver while reaching near passthrough performance.

Core Mechanism — vhost-user Packet Receive

Background: A VM sends or receives packets through virtio-net, but the host wants DPDK throughput instead of kernel tap networking.

Plan: 1) QEMU sends guest memory and vring addresses to the DPDK backend, 2) the guest publishes descriptors in the available ring, 3) the backend maps descriptor guest physical addresses to host virtual addresses, 4) the backend copies or attaches packet data into mbufs, 5) it updates the used ring and optionally signals the guest.

Example: the guest publishes desc 5 for a 2048-byte RX buffer. OVS-DPDK reads desc 5 through shared memory, writes a packet into that buffer, places desc 5 in the used ring with the received length, then kicks the guest through eventfd if notification suppression is disabled.

Minimal C Demo — Virtqueue Descriptor Handoff

virtio split queue — avail ring to used ring — C Demo
stdin (optional)

15. § 9.14 — Pipeline & Run-to-Completion Models

Overview — Two Ways to Assign Work to Cores

A DPDK datapath either keeps the full packet lifecycle on one lcore (run-to-completion) or splits work into stages connected by rte_ring queues (pipeline). The right choice is a cache-locality decision: simple L2/L3 forwarding usually prefers R2C; expensive or uneven work such as crypto, DPI, or control-plane punts often needs a pipeline.

Pipeline Model — Specialized Stages

Pipeline mode dedicates lcores to stages: RX, worker, TX, or a richer chain such as parse → ACL → crypto → route → transmit. Each handoff is a pointer enqueue/dequeue through an rte_ring, which costs cache-line traffic but lets bottleneck stages scale independently.

Key Design Trade-Offs

ModelStrengthCostBest fit
Run-to-completionNo inter-core handoff; packet metadata stays hot in one cacheOne lcore must perform every operation; hard to isolate slow workL2/L3 forwarding, NAT, firewall fast path
PipelineSpecialized stages; add workers only where the bottleneck isRing handoff, cache misses, more latency and backpressure logicCrypto, DPI, complex service chains
HybridR2C for common flows, pipeline for slow or exceptional flowsTwo code paths and careful state sharingVirtual switches and gateways with fast/slow path split

Core Mechanism — Choosing R2C or Pipeline

Background: A gateway receives 32-packet bursts. Most packets only need session lookup and forwarding, but a small fraction need slow ACL and crypto work.

Plan: 1) use R2C for session hits so packet cache lines stay on the polling lcore, 2) enqueue session misses or crypto packets to a worker ring, 3) return processed packets to a TX ring, 4) measure the worker stage and add lcores only there.

Example: packets 0–29 hit the per-lcore session cache and transmit immediately on lcore 2. Packets 30–31 miss and are enqueued to a worker. The worker creates sessions, then sends those two packets to the TX lcore. The hot path avoids ring handoff for 30 out of 32 packets.

Minimal C Demo — R2C vs Pipeline Flow

R2C and pipeline — same packets, different ownership — C Demo
stdin (optional)

16. § 9.15 — DPDK Hot Upgrade — Deep Dive

Overview — Upgrade Without Losing Datapath State

Hot upgrade is hard because a DPDK process owns hugepage memory, NIC queues, timers, session tables, and in-flight packet bursts. The practical design is to separate durable datapath state from process-local execution state: keep sessions and flow tables in named shared memzones, attach a new process to the same hugepages, validate compatibility, then swap traffic ownership with a short drain window.

Shared Memory Persistence

Hugepage files in hugetlbfs can outlive one binary execution when the application controls cleanup. A new binary using the same --file-prefix can locate named memzones, read a schema version and generation number, and import or migrate existing state. This preserves session continuity even when NIC queues must be reconfigured.

Key State Boundaries

StateUpgrade handlingRisk
Session tableStore in rte_memzone with schema version and generationOld and new layout mismatch can corrupt forwarding decisions
Flow table / NAT bindingsPersist keys, actions, timeout, counters; rebuild hardware rte_flow rulesHardware rules may briefly lag software state
Descriptor ringsUsually process/NIC-local; drain or reinitialize rather than share blindlyOwnership transfer bugs can duplicate or lose packets
TimersSerialize next expiry or rebuild from session timestampsExpired sessions may survive too long after restart
In-flight burstsStop admission, drain rings, then swap active processSmall transition window can still drop packets

Core Mechanism — Multi-Process Hot Upgrade

Background: A virtual switch holds millions of sessions and must upgrade the binary without rebuilding those sessions from scratch.

Plan: 1) old primary exports sessions into versioned memzones, 2) new secondary attaches to the same hugepages, 3) new process validates schema and warms flow caches, 4) old process stops accepting new sessions and drains rings, 5) traffic ownership switches, 6) old process exits after final counters are merged.

Example: version N stores each NAT session as key, translated tuple, timeout, and counters. Version N+1 reads generation 42, sees schema 1 is supported, rebuilds its per-lcore cache from the shared table, installs required hardwarerte_flow rules, and then becomes active.

Minimal C Demo — Shared State Handoff

Hot upgrade — shared memzone state import — C Demo
stdin (optional)

17. § 9.16 — Session Management: Fast Path & Slow Path

Overview — First Packet Pays, Later Packets Hit Cache

Session management turns expensive per-packet decisions into a cached forwarding action. The first packet of a flow misses the table and goes through ACL, routing, NAT, and policy logic. The created session stores the 5-tuple, translated tuple, output port, timeout, and QoS class so later packets run through a per-lcore hash lookup and transmit immediately.

Slow Path — Build the Cached Decision

On a session miss, the packet leaves the hot path. The slow path may run on the same lcore for simple gateways or be enqueued to a control-plane worker through an rte_ring. The important rule is that session publication is controlled: per-lcore tables avoid locks, while global tables need RCU or writer-side locks so readers never see half-initialized state.

Key Data Structures — Session Table

Field / DesignPurposeHot-path note
5-tuple keysrc/dst IP, src/dst port, protocolExact key for rte_hash or cuckoo table lookup
cached actionoutput port, NAT tuple, VLAN/VXLAN rewrite, QoS classAvoids rerunning ACL and route lookup per packet
timeout / last_seenidle aging and garbage collectionUpdate lazily or per-burst to avoid cache-line writes every packet
per-lcore tableone table per PMD lcoreNo locks; flow stickiness depends on RSS or software steering
global RCU tableshared state across lcoresReaders are lock-free; writers build entry then publish pointer
timer wheelbucket sessions by expiry tickAging cost is spread across ticks instead of full-table scans

Core Mechanism — Fast/Slow Path Walkthrough

Background: A NAT gateway receives a new TCP flow. Running ACL, route lookup, and NAT allocation for every packet would waste the line-rate budget.

Plan: 1) hash the packet 5-tuple, 2) on hit apply the cached action, 3) on miss send the first packet through ACL, LPM, and NAT allocation, 4) publish the session entry, 5) send later packets through the fast path.

Example: packet 1 for 10.0.0.1:12345 → 203.0.113.8:443 misses, gets NATed to 10.0.0.9:40000, and creates a session. Packet 2 hashes to the same entry and only applies the cached rewrite and output port.

Minimal C Demo — Session Miss Then Fast-Path Hit

Session table — slow-path create, fast-path hit — C Demo
stdin (optional)

18. § 9.17 — QoS in DPDK

Overview — Shape, Police, and Schedule Bursts

QoS in a DPDK datapath usually has two layers. rte_meter colors packets with token bucket policers such as srTCM or trTCM. rte_sched then schedules traffic hierarchically: port, subport, pipe, traffic class, and queue. This lets one process enforce tenant, VM, or service-class limits before packets reach the TX ring.

Token Bucket Policing

A token bucket adds tokens at a configured rate and caps them at the burst size. A packet consumes tokens equal to its length. srTCM and trTCM turn that check into colors: green packets conform, yellow packets exceed committed rate but may pass at lower priority, and red packets violate policy and are dropped or remarked.

Key Data Structures — Meter and Scheduler

ObjectFields / APIPurpose
rte_meter_srtcmCIR, CBS, EBS, color-aware modeSingle-rate three-color marker
rte_meter_trtcmCIR/PIR plus committed/peak burst sizesTwo-rate policing for committed and peak traffic
rte_sched_portsubports, pipes, traffic classes, queuesHierarchical scheduler state
pipe profilerate, TC period, queue sizes, WFQ weightsPer-tenant or per-VM shaping policy
rte_sched_port_enqueue/dequeueenqueue mbufs, dequeue eligible packetsBackpressure and scheduling before tx_burst

Core Mechanism — Per-VM Rate Limit

Background: A host has many VMs behind OVS-DPDK. One VM must not consume the entire 25G port during a burst.

Plan: 1) classify packet to VM or tenant, 2) run its token bucket, 3) drop or mark red packets, 4) enqueue green/yellow packets into the scheduler pipe, 5) let rte_sched choose eligible packets for TX.

Example: VM A has a 1 Gbps committed rate and 64 KB burst. A 9 KB packet passes while tokens are available. Once the bucket drains, later packets turn yellow or red until tokens refill.

Minimal C Demo — Token Bucket Colors

QoS token bucket — green, yellow, red packets — C Demo
stdin (optional)

19. § 9.18 — DPDK Bonding & Link Aggregation

Overview — One Logical Port Over Multiple NICs

rte_eth_bond exposes multiple physical ports as one logical ethdev. Applications call normal ethdev APIs on the bond port, while the bonding PMD chooses a slave port according to the configured mode. Active-backup is common for redundancy; 802.3ad/LACP is common when the upstream switch participates in link aggregation.

Key Modes

ModeBehaviorUse case
round-robinTransmit packets across slaves in orderLab aggregation; can reorder flows
active-backupOne active slave; fail over when link goes downRedundancy without switch hashing
balance XORHash packet fields to select slaveFlow-stable load distribution
802.3ad / LACPNegotiated aggregation with upstream switchProduction bandwidth plus redundancy
broadcastSend every packet on every slaveSpecial redundancy cases, expensive

Core Mechanism — Active-Backup Failover

Background: A virtual switch needs to survive one NIC or cable failure without changing the application datapath.

Plan: 1) create a bond ethdev, 2) add two slave ports, 3) set a primary slave, 4) poll link state, 5) switch active slave on failure, 6) keep the application using the same logical port id.

Example: port 0 is active and port 1 is standby. If port 0 link drops, the bond PMD marks port 1 active. The app still transmits to bond port 7; only the bond PMD changes which physical TX queue receives descriptors.

20. § 9.19 — OVS-DPDK: Open vSwitch with DPDK Datapath

Overview — OpenFlow Control Plane, DPDK Packet Path

OVS-DPDK keeps the OVS control plane but replaces the kernel datapath with userspace PMD threads. ovs-vswitchd owns bridges and flow programming, while netdev-dpdk ports poll physical NICs and vhost-user sockets with DPDK. The fast path is a cascade of caches: EMC, datapath classifier, and finally the full OpenFlow pipeline.

Port Types — Physical, VM, Patch, Internal, Tunnel

OVS-DPDK bridges combine several port types: dpdk for physical NICs, dpdkvhostuser for VM virtio backends, patch ports between bridges, internal ports for host networking, and tunnel ports for VXLAN or Geneve encap/decap in userspace.

Key Data Structures — OVS Datapath Caches

LayerRolePerformance note
EMCExact Match Cache per PMD threadFastest path for repeated 5-tuples; avoids tuple-space search
dpclsDatapath classifier / tuple-space searchMatches megaflows with masks and cached actions
ofprotoFull OpenFlow pipelineSlow path for first packet or cache miss; installs megaflow
megaflowWildcarded cached flowCompresses many exact flows under one masked rule
PMD threadPolls RX queues and vhost-user portsPin with pmd-cpu-mask and align queues to cores

Core Mechanism — OVS-DPDK Flow Lookup

Background: A VM packet enters a vhost-user port. OVS must decide whether to forward to another VM, a physical NIC, a tunnel, or the OpenFlow slow path.

Plan: 1) PMD receives a burst, 2) check EMC for exact-flow hit, 3) on miss check dpcls megaflows, 4) on miss run ofproto, 5) install or update a megaflow, 6) execute actions such as output, drop, recirculate, or VXLAN encap.

Example: the first packet from VM A to VM B misses EMC and dpcls, so ofproto runs the OpenFlow table and installs a megaflow. The next packet with matching masked fields hits dpcls; repeated packets from the same exact tuple hit EMC.

21. Kernel Source Pointers

File / SymbolWhat it contains
lib/eal/linux/eal.c — rte_eal_init()The 10-step EAL initialization sequence; calls sub-functions below
lib/eal/linux/eal_hugepage_info.c — eal_hugepage_init()mmap() hugetlbfs pages, build memseg list per NUMA socket
lib/eal/common/eal_common_lcore.cLcore thread creation, CPU affinity setup (pthread_setaffinity_np)
lib/eal/linux/eal_pci.c — rte_pci_scan()Enumerate /sys/bus/pci/devices/, match PCI IDs to PMD table
drivers/net/ixgbe/ixgbe_rxtx.c — ixgbe_recv_pkts()ixgbe PMD rx_burst: DD bit scan, mbuf harvest, ring refill, doorbell
drivers/net/ixgbe/ixgbe_rxtx.c — ixgbe_xmit_pkts()ixgbe PMD tx_burst: fill TX descs, TDT doorbell, tx_free_thresh cleanup
drivers/net/mlx5/mlx5_rx.c — mlx5_rx_burst()Mellanox ConnectX PMD rx_burst using Completion Queue (CQ) model
lib/ethdev/rte_ethdev.c — rte_eth_rx_burst() (inline)Dispatch: calls port->rx_pkt_burst function pointer (per-PMD hot path)
lib/ethdev/rte_ethdev.c — rte_eth_dev_rss_hash_update()RSS hash configuration and RETA interaction through ethdev API
lib/ethdev/rte_flow.c — rte_flow_create()Generic flow rule validation, creation, destruction API
lib/mempool/rte_mempool.c — rte_mempool_get_bulk()Mempool cache refill/drain and central pool operations
lib/mbuf/rte_mbuf.c — rte_pktmbuf_pool_create()Packet mbuf pool creation and mbuf object initialization
lib/eal/common/eal_common_memory.c — rte_malloc_socket()NUMA-aware memory allocation helpers for per-socket data
lib/eal/common/eal_common_lcore.c — rte_socket_id()Current lcore socket lookup used for locality validation
lib/ethdev/rte_ethdev.c — rte_eth_dev_socket_id()NIC port NUMA socket lookup from ethdev metadata
lib/ring/rte_ring.c — rte_ring_create()Ring allocation, size/mask setup, producer/consumer sync mode
lib/ring/rte_ring_elem_pvt.h — __rte_ring_do_enqueue_elem()MP enqueue logic: reserve head, copy objects, publish tail
lib/eal/x86/include/rte_prefetch.hArchitecture-specific prefetch helpers mapping to CPU instructions
lib/lpm/rte_lpm.c — rte_lpm_add(), rte_lpm_lookup()DIR-24-8 IPv4 longest prefix match table management and lookup
lib/lpm/rte_lpm6.c — rte_lpm6_lookup()IPv6 multi-stride longest prefix match implementation
lib/acl/rte_acl.c — rte_acl_build(), rte_acl_classify()ACL rule compilation and batch packet classification dispatch
lib/hash/rte_cuckoo_hash.c — rte_hash_lookup_data()Cuckoo hash exact-match lookup for flow/session keys
lib/ethdev/rte_flow.c — rte_flow_create()Generic hardware flow rule validation and creation
drivers/net/ixgbe/rte_pmd_ixgbe.cIntel ixgbe PF helper APIs for VF MAC/VLAN controls
lib/ethdev/rte_ethdev.c — rte_eth_dev_info_get()Advertised RX/TX offload capability discovery
lib/mbuf/rte_mbuf_core.hmbuf offload flags, header length fields, packet metadata
lib/eal/x86/include/rte_vect.hx86 SIMD vector typedefs and runtime CPU feature integration
lib/eal/x86/include/rte_memcpy.hSIMD-optimized memory copy routines used by DPDK fast paths
drivers/net/virtio/virtqueue.hVirtio ring data structures and queue helpers used by the virtio PMD
drivers/net/virtio/virtio_rxtx.c — virtio_recv_pkts()Virtio PMD RX path walking virtqueue descriptors
lib/vhost/vhost_user.cvhost-user protocol messages, memory table setup, vring configuration
lib/vhost/virtio_net.c — rte_vhost_dequeue_burst(), rte_vhost_enqueue_burst()DPDK vhost datapath burst APIs for VM packet I/O
drivers/vdpa/*vDPA device drivers that offload virtio datapath to hardware
lib/pipeline/rte_pipeline.c — rte_pipeline_run()DPDK pipeline framework input ports, tables, and output ports
examples/l2fwd/main.cCanonical run-to-completion forwarding loop
examples/ip_pipeline/DPDK pipeline-style datapath example
lib/eal/common/eal_common_proc.cMulti-process communication primitives used by primary and secondary processes
lib/eal/common/eal_common_memzone.c — rte_memzone_lookup()Named shared memory lookup for persistent state across processes
lib/hash/rte_cuckoo_hash.cExact-match session table mechanics used by fast-path flow caches
lib/rcu/rte_rcu_qsbr.cQuiescent-state based RCU used for lock-free reader / controlled writer patterns
lib/timer/rte_timer.cTimer management useful for session timeout and aging wheels
lib/meter/rte_meter.csrTCM and trTCM token bucket color marking
lib/sched/rte_sched.cHierarchical QoS scheduler: port, subport, pipe, traffic class, queue
drivers/net/bonding/rte_eth_bond_pmd.cBonding PMD implementation and active-backup / balance mode datapath
lib/netdev-dpdk.c (OVS source)OVS-DPDK netdev provider, PMD RX/TX, vhost-user integration
lib/dpif-netdev.c (OVS source)OVS userspace datapath, EMC, dpcls lookup, PMD thread loop
lib/dpif-netdev-private*.h (OVS source)OVS datapath cache and classifier data structures
drivers/bus/pci/linux/pci_uio.cigb_uio BAR mmap: maps NIC register space into userspace process
drivers/bus/pci/linux/pci_vfio.cVFIO device open, IOMMU group handling, DMA mapping via ioctl

22. Interview Prep

QuestionConcise Answer
Why does kernel networking fail above ~1 Mpps? Name 5 bottlenecks.1) IRQ per packet (~2–5 µs handler + softirq). 2) sk_buff slab alloc per packet (~200 ns, cache miss). 3) copy_to_user() memcpy polluting cache. 4) recv() syscall context switch (~100–200 ns). 5) Lock contention in netfilter / routing table / socket hash under high PPS. At 10 GbE 64B frames, each packet gets only 67 ns — the kernel stack exceeds that budget.
Walk through rte_eal_init() — all 10 steps.1) Parse CLI (--lcores, --socket-mem). 2) Load PMD plugin .so files. 3) Read CPU topology from /proc/cpuinfo, build core+NUMA map. 4) mmap()+mlock() huge pages, build memseg list per NUMA socket. 5) rte_memzone_init() — named regions in huge pages. 6) Create one pthread per lcore, pin via pthread_setaffinity_np(). 7) Enumerate /sys/bus/pci/devices/, match PCI IDs to PMD table. 8) rte_pci_probe() → PMD eth_dev_init() for each matched NIC. 9) Start service cores (timer, interrupt). 10) rte_eal_mp_remote_launch() — worker functions start polling.
What is the DD bit and what is the doorbell in DPDK PMD?DD (Descriptor Done): a status bit in each RX or TX descriptor that the NIC sets to 1 when it has finished with that slot (RX: packet DMA-written; TX: packet transmitted). The PMD polls DD instead of waiting for an interrupt. Doorbell: a write to a NIC BAR register (RDT for RX, TDT for TX) telling the NIC the new head/tail pointer — i.e., how many new buffers the PMD has made available. In DPDK, the doorbell is batched once per rx_burst/tx_burst call to amortize the PCIe transaction cost.
What is the difference between UIO and VFIO? When should you use VFIO?UIO (igb_uio.ko): exposes /dev/uioN; DPDK mmap()s BAR directly. No IOMMU — the DMA address space is physical memory, so a bug lets the NIC DMA anywhere. Requires root. VFIO (vfio-pci.ko): groups the device by IOMMU domain; DPDK registers hugepage memory with the IOMMU via VFIO_MAP_DMA ioctl; the NIC can only DMA into those regions. Preferred for production (SR-IOV VFs, containers, secure multi-tenant environments). Use VFIO whenever the system has an IOMMU and you care about isolation.
Why does rte_eth_rx_burst() process packets in batches of 32?Three reasons: 1) Amortize doorbell write — one PCIe RDT update per 32 packets costs ~3 ns/pkt vs ~100 ns/pkt for per-packet updates. 2) Prefetch pipeline — prefetching 3–4 mbufs ahead while processing the current one hides DRAM latency; 32 fits without L1 overflow. 3) SIMD alignment — 32 × 64-bit descriptors = 256 bytes, checkable with 4 AVX2 registers in one pass for the DD bit.
How does RSS map a packet to an RX queue?The NIC hashes selected header fields, usually the 5-tuple, with a Toeplitz RSS key. The low bits of that hash index the RETA, and the RETA entry contains the RX queue id. Software can rebalance by changing RETA entries. The same flow maps to the same queue, preserving packet order inside the flow.
When do you use rte_flow instead of RSS?Use RSS for broad load distribution across queues. Use rte_flow when you need exact steering or actions: send tenant X to queue 7, drop a port, mark packets, count matches, or offload tunnel encap/decap. rte_flow rules are validated against NIC capabilities; unsupported patterns may fail or fall back to software depending on PMD.
Why does DPDK require huge pages and mempools?Huge pages reduce TLB pressure and provide pinned DMA memory registered with VFIO/IOMMU. Mempools avoid malloc/free on the hot path by pre-allocating fixed-size mbufs. Per-lcore caches serve most allocations without atomics; the central ring is touched only on bulk refill/drain.
Explain rte_mbuf fields that matter in RX/TX.buf_addr is the virtual base, buf_iova is the DMA address used in descriptors, data_off points to packet start after headroom, data_len is bytes in this segment, pkt_len is total packet length across chained segments, next links multi-segment packets, ol_flags carries checksum/TSO/VLAN offload state, and packet_type carries parser results from the NIC or PMD.
What is NUMA-aware programming in DPDK? What happens if you violate it?Place the NIC, mempool, descriptor rings, polling lcore, and hot per-lcore data on the same socket. Use rte_eth_dev_socket_id(port), rte_socket_id(), and socket_id arguments to enforce it. If a socket-0 NIC is polled by a socket-1 lcore, packet bytes and mbuf metadata cross UPI/QPI every packet, often causing a 30–50% throughput drop.
Explain the rte_ring multi-producer enqueue algorithm.A producer loads prod.head and cons.tail, checks free space, computes new_head, then CAS-es prod.head from old_head to new_head to reserve slots. It copies objects into ring[index & mask], issues a write barrier, spins until prod.tail equals old_head so predecessors publish first, then stores prod.tail = new_head. That split between head reservation and tail publication preserves FIFO order without a mutex.
Why are prod and cons fields in rte_ring cache-line aligned?Producers frequently write prod.head/prod.tail and consumers frequently write cons.head/cons.tail. If those fields share one cache line, every enqueue/dequeue bounces the line between cores. __rte_cache_aligned separates them so producer writes do not invalidate consumer-owned cache lines.
How does DPDK use prefetch and DDIO in the receive path?DDIO lets the NIC DMA packet bytes into LLC instead of only DRAM. The PMD then uses rte_prefetch0() a few mbufs ahead, commonly 3–4, so metadata and packet data are pulled into L1 before parsing. The goal is to overlap cache miss latency with useful work on current packets.
What is DIR-24-8 in rte_lpm, and why is it fast?DIR-24-8 uses the top 24 IPv4 bits as a direct table index. Prefixes /0 through /24 return in one memory access. More specific /25 through /32 prefixes use a second 256-entry tbl8 group indexed by the low 8 bits. That makes common route lookup one or two dependent loads instead of walking a long trie.
Compare rte_acl, rte_lpm, and rte_hash.rte_lpm is longest-prefix route lookup, usually destination IP to next hop. rte_acl is ordered multi-field rule classification over fields like IPs, ports, protocol, and priority. rte_hash is exact-match lookup for full keys such as a 5-tuple session. In a datapath, a miss may go ACL -> LPM -> NAT -> create exact session in rte_hash.
How does SR-IOV differ from multi-queue on one PF?Multi-queue gives one PCI function many RX/TX queues, usually owned by one DPDK process. SR-IOV creates separate PCI Virtual Functions, each with its own BDF, queues, and VFIO/IOMMU isolation. The PF configures policy such as VF MAC/VLAN filters, while each VF can be passed to a VM or container for near-native DMA.
How do checksum and TSO offloads work in DPDK?The application enables port offload capabilities, then marks each mbuf with ol_flags and sets l2_len/l3_len/l4_len. For TSO it also sets tso_segsz. The PMD writes TX descriptors containing those fields, and the NIC segments large TCP payloads into MTU-sized frames and computes IPv4/TCP checksums per frame.
Where does SIMD help in DPDK?SIMD helps wherever the same operation is repeated across a burst: ACL DFA state transitions, hash bucket signature comparison, LPM batch lookup, checksum arithmetic, descriptor DD-bit checks, and rte_memcpy. DPDK probes CPU flags at runtime and selects SSE4.2, AVX2, AVX-512, or scalar implementations.
Explain virtio split virtqueue and vhost-user.A split virtqueue has a descriptor table, an available ring written by the guest driver, and a used ring written by the backend. vhost-user moves the backend into userspace: QEMU sends vring addresses, memory tables, and eventfds over a Unix socket, while DPDK reads and writes guest buffers through shared hugepage memory using rte_vhost_dequeue_burst() and rte_vhost_enqueue_burst().
How is vDPA different from vhost-user?vhost-user uses a software DPDK backend to walk virtio descriptors. vDPA keeps the guest-facing virtio interface but offloads the data path to hardware: the NIC DMA-reads and writes guest virtio rings directly, while software handles control plane, feature negotiation, and migration hooks.
Compare run-to-completion and pipeline models.R2C keeps rx_burst, parsing, classification, modification, and tx_burst on one lcore. It is simple and cache-local, best for forwarding/NAT/firewall fast paths. Pipeline splits stages across lcores connected by rte_ring; it adds handoff latency and cache misses but lets expensive stages like crypto or DPI scale independently.
How would you implement a DPDK hot upgrade?Store durable state such as sessions and flow tables in named shared memzones with schema versions and generation numbers. Start the new binary as a secondary attached to the same --file-prefix hugepages, validate and import state, warm caches and hardware flow rules, stop admission on the old process, drain rings, swap traffic ownership, merge counters, then exit the old process.
Explain DPDK session fast path and slow path.The fast path hashes the packet 5-tuple into a per-lcore or RCU-protected session table. On hit it applies cached actions such as NAT rewrite, output port, and QoS class, then transmits. On miss the first packet goes through ACL, route lookup, NAT allocation, and policy; the result is published as a session so later packets avoid the expensive work.
How would you age sessions without hurting packet throughput?Avoid writing shared timeout fields on every packet. Use lazy refresh, per-lcore counters, or timestamp updates once per burst/window. Put sessions into a timer wheel or expiry buckets so cleanup scans only the current bucket. With global tables, delete through RCU or a quiescent-state scheme so readers never see freed entries.
How do rte_meter and rte_sched differ?rte_meter is a policer/marker: it uses token bucket logic such as srTCM or trTCM to color packets green, yellow, or red. rte_sched is a hierarchical scheduler: packets are enqueued into port/subport/pipe/traffic-class/queue levels, then dequeued according to rate, priority, and WFQ rules before tx_burst.
When would you use DPDK bonding active-backup versus 802.3ad?Use active-backup when you need simple NIC redundancy without depending on switch-side aggregation; only one slave carries traffic and failover switches to the standby. Use 802.3ad/LACP when the upstream switch participates and you want both redundancy and aggregate bandwidth with flow-stable hashing.
Describe the OVS-DPDK lookup pipeline.A PMD thread receives a burst from a dpdk or dpdkvhostuser port. It first checks EMC, the per-PMD exact match cache. On miss it checks dpcls tuple-space megaflows. On dpcls miss it runs the ofproto/OpenFlow slow path, computes actions, and installs or updates a megaflow. Later exact repeated tuples hit EMC; related wildcarded traffic hits dpcls.
What OVS-DPDK tuning knobs matter in production?Pin PMD threads with other_config:pmd-cpu-mask, align RX queues with options:n_rxq and NUMA locality, use huge pages and isolated CPUs, tune EMC insertion probability for flow churn, keep vhost-user ports on local NUMA nodes, and verify queue-to-PMD assignment with ovs-appctl dpif-netdev/pmd-rxq-show.