Part XIII — Hardware

§ 13 PCIe, BAR, DMA/IOMMU, and NUMA Topology

How Linux talks to PCIe devices: packets on the PCIe fabric, MMIO register access, safe DMA, VFIO, and NUMA placement.

1. Overview

Hardware communication sits below drivers and high-performance frameworks such as DPDK. The CPU reaches devices through PCIe BARs, devices reach memory through DMA, and the IOMMU decides which DMA addresses are valid.

2. Key Data Structures

PCIe Topology

A PCIe system is a tree rooted at a CPU root complex. Switches add downstream ports, and endpoints such as NICs, GPUs, and NVMe drives consume configuration space, BAR windows, interrupts, and DMA transactions.

Transaction Layer Packet

Every PCIe memory read, write, and completion is carried as a Transaction Layer Packet. The header tells fabric routers where the packet goes and how the requester should match the response.

FieldType / SizePurpose
Fmt/Typeheader bitsEncodes MRd, MWr, Cpl, CplD, config, or message request.
Requester IDbus:device:functionIdentifies the source function so completions can return correctly.
Tag8 or more bitsMatches a completion to one outstanding read request.
Address32 or 64 bitsTarget MMIO, memory, or configuration address.
LengthDW countNumber of 4-byte words transferred in the payload.

Link training is managed by the LTSSM. A device does not send normal TLPs until the link has detected a partner, trained lanes, negotiated width and speed, and reached L0.

PCIe Generationx1 usable bandwidthx4x8x16
Gen1~250 MB/s~1 GB/s~2 GB/s~4 GB/s
Gen2~500 MB/s~2 GB/s~4 GB/s~8 GB/s
Gen3~985 MB/s~3.94 GB/s~7.88 GB/s~15.75 GB/s
Gen4~1.97 GB/s~7.88 GB/s~15.75 GB/s~31.5 GB/s
Gen5~3.94 GB/s~15.75 GB/s~31.5 GB/s~63 GB/s

BAR Register Layout

A Base Address Register exposes device registers as a CPU physical address range. Linux maps that range into kernel virtual space, then the driver uses readl() and writel() to touch registers.

Kernel object or APIWhat it representsFast-path concern
struct pci_devOne PCI function discovered by bus enumeration.Holds BAR resources, MSI-X vectors, NUMA node, and DMA mask.
pci_ioremap_bar()Maps a BAR into a kernel virtual MMIO pointer.Use only for registers, not normal RAM access.
dma_addr_tDevice-visible bus address returned by the DMA API.Never substitute CPU virtual or physical addresses.
iommu_groupSet of devices sharing one DMA isolation boundary.VFIO assignment must include the whole group.

3. Core Mechanism

MMIO Writes and Doorbells

MMIO uses normal-looking CPU load/store instructions, but the target cache policy is device memory. A NIC doorbell write is usually a posted PCIe Memory Write TLP, so the CPU can continue without waiting for a completion.

Write-combining matters when a driver posts descriptors and then rings a queue tail register. Combining several adjacent stores into one PCIe burst reduces transaction overhead and avoids slow read-after-write patterns.

Background: A NIC driver has filled four TX descriptors in host memory and must tell the device that queue tail moved from 64 to 68.

Plan: First write packet descriptors into coherent DMA memory. Then issue the needed memory barrier for descriptor visibility. Finally write the tail index to the BAR doorbell register.

StepState changeWhy it is correct
1Driver stores descriptor addresses, lengths, and flags into DMA ring slots 64-67.The device will later DMA-read those cache-coherent ring entries.
2Driver orders descriptor writes before the MMIO doorbell.The NIC must not observe the new tail before descriptors are valid.
3Driver writes 68 to the tail register in BAR0.The posted MWr tells the NIC there are four new descriptors to consume.
4NIC DMA-reads descriptors and packet bytes, then transmits packets.No syscall or CPU copy is needed after the queue is armed.

DMA Address Translation

A buffer has three identities: CPU virtual address for software, CPU physical address for DRAM, and DMA address for the device. With an IOMMU enabled, the DMA address is an IOVA translated through IOMMU page tables.

The IOMMU blocks devices from reaching memory they were not explicitly mapped to. That protection is why VFIO can safely hand a PCIe function to userspace.

Devices in the same IOMMU group cannot be isolated from each other, often because a bridge lacks access control services. Assigning only one device from such a group would leave a DMA escape path.

VFIO exposes the group, device regions, IRQ setup, and DMA map calls to userspace while the kernel keeps the IOMMU domain in control.

Example: A DPDK process maps 2 MB of huge-page memory at IOVA 0x10000000. The NIC RX descriptor receives that IOVA, not a C pointer. When a packet arrives, the NIC writes to IOVA 0x10000400, VT-d translates it to the pinned huge-page frame, and invalid IOVAs fault instead of corrupting arbitrary RAM.

NUMA Placement

NUMA topology decides whether a CPU core, memory buffer, and PCIe device are local to the same socket. Packet processing suffers when a core on socket 1 repeatedly touches a mempool allocated on socket 0 for a NIC attached to socket 0.

Linux exposes CPU topology as packages, dies, cores, and logical CPUs. Performance tuning starts by matching IRQs, polling threads, and memory allocation to the PCIe device locality.

AccessTypical latencyImplication
L3 cache hit~10 nsKeep hot queue metadata and descriptors cache-resident.
Local DRAM~60 nsAllocate RX/TX rings and packet pools on the NIC socket.
Remote DRAM~120 nsRemote mempools double load latency and burn inter-socket bandwidth.

4. Minimal C Demo

PCIe TLP Types — C Demo
stdin (optional)
DMA Address Spaces — C Demo
stdin (optional)
NUMA Locality Cost — C Demo
stdin (optional)

5. Kernel Source Pointers

TopicFiles / functionsWhat to inspect
PCI enumerationdrivers/pci/probe.c, pci_scan_device(), pci_setup_device()How config space becomes struct pci_dev.
BAR mappingdrivers/pci/iomap.c, pci_ioremap_bar()How drivers map device register windows.
DMA APIkernel/dma/mapping.c, dma_alloc_attrs(), dma_map_page_attrs()Coherent allocation versus streaming maps.
IOMMUdrivers/iommu/iommu.c, iommu_map(), iommu_attach_device()Domain attachment and IOVA to physical mappings.
VFIOdrivers/vfio/vfio_main.c, drivers/vfio/pci/vfio_pci_core.cContainer, group, device fd, BAR mmap, and IRQ setup.
NUMA topologydrivers/base/node.c, include/linux/topology.hHow node and CPU locality are exposed to drivers and sysfs.

6. Interview Prep

QuestionConcise answer
What is a TLP in PCIe?A Transaction Layer Packet is the PCIe unit for reads, writes, completions, config access, and messages. MRd requests data, MWr posts data, and CplD returns read data.
What are BAR registers?BARs advertise device address windows. Linux assigns physical address ranges and maps them with pci_ioremap_bar() so the driver can access MMIO registers.
Why use write-combining for NIC doorbells?Doorbells are small posted writes. WC can merge adjacent writes into fewer PCIe bursts and avoid expensive read completions on the fast path.
Virtual, physical, and bus address: what is the difference?Virtual is what CPU software dereferences, physical is the DRAM frame, and bus or DMA address is what the device uses. The IOMMU may translate bus address to physical address.
Why does DPDK VFIO need the IOMMU?VFIO lets userspace drive the device, so the kernel needs IOMMU mappings to restrict DMA to pinned memory owned by that process.
What is an IOMMU group?It is the smallest set of devices that can be isolated for DMA. All devices in a group must be assigned together because they may reach each other without IOMMU separation.
How do you enforce NUMA locality in DPDK?Pin lcores to the NIC socket, allocate mempools on the same socket, and verify topology with lscpu, sysfs, or EAL logs.
What does numactl --cpunodebind=0 --membind=0 do?It runs the process on CPUs from node 0 and asks Linux to allocate memory from node 0, keeping compute and memory local.