§ 13 PCIe, BAR, DMA/IOMMU, and NUMA Topology
How Linux talks to PCIe devices: packets on the PCIe fabric, MMIO register access, safe DMA, VFIO, and NUMA placement.
1. Overview
Hardware communication sits below drivers and high-performance frameworks such as DPDK. The CPU reaches devices through PCIe BARs, devices reach memory through DMA, and the IOMMU decides which DMA addresses are valid.
2. Key Data Structures
PCIe Topology
A PCIe system is a tree rooted at a CPU root complex. Switches add downstream ports, and endpoints such as NICs, GPUs, and NVMe drives consume configuration space, BAR windows, interrupts, and DMA transactions.
Transaction Layer Packet
Every PCIe memory read, write, and completion is carried as a Transaction Layer Packet. The header tells fabric routers where the packet goes and how the requester should match the response.
| Field | Type / Size | Purpose |
|---|---|---|
| Fmt/Type | header bits | Encodes MRd, MWr, Cpl, CplD, config, or message request. |
| Requester ID | bus:device:function | Identifies the source function so completions can return correctly. |
| Tag | 8 or more bits | Matches a completion to one outstanding read request. |
| Address | 32 or 64 bits | Target MMIO, memory, or configuration address. |
| Length | DW count | Number of 4-byte words transferred in the payload. |
Link training is managed by the LTSSM. A device does not send normal TLPs until the link has detected a partner, trained lanes, negotiated width and speed, and reached L0.
| PCIe Generation | x1 usable bandwidth | x4 | x8 | x16 |
|---|---|---|---|---|
| Gen1 | ~250 MB/s | ~1 GB/s | ~2 GB/s | ~4 GB/s |
| Gen2 | ~500 MB/s | ~2 GB/s | ~4 GB/s | ~8 GB/s |
| Gen3 | ~985 MB/s | ~3.94 GB/s | ~7.88 GB/s | ~15.75 GB/s |
| Gen4 | ~1.97 GB/s | ~7.88 GB/s | ~15.75 GB/s | ~31.5 GB/s |
| Gen5 | ~3.94 GB/s | ~15.75 GB/s | ~31.5 GB/s | ~63 GB/s |
BAR Register Layout
A Base Address Register exposes device registers as a CPU physical address range. Linux maps that range into kernel virtual space, then the driver uses readl() and writel() to touch registers.
| Kernel object or API | What it represents | Fast-path concern |
|---|---|---|
struct pci_dev | One PCI function discovered by bus enumeration. | Holds BAR resources, MSI-X vectors, NUMA node, and DMA mask. |
pci_ioremap_bar() | Maps a BAR into a kernel virtual MMIO pointer. | Use only for registers, not normal RAM access. |
dma_addr_t | Device-visible bus address returned by the DMA API. | Never substitute CPU virtual or physical addresses. |
iommu_group | Set of devices sharing one DMA isolation boundary. | VFIO assignment must include the whole group. |
3. Core Mechanism
MMIO Writes and Doorbells
MMIO uses normal-looking CPU load/store instructions, but the target cache policy is device memory. A NIC doorbell write is usually a posted PCIe Memory Write TLP, so the CPU can continue without waiting for a completion.
Write-combining matters when a driver posts descriptors and then rings a queue tail register. Combining several adjacent stores into one PCIe burst reduces transaction overhead and avoids slow read-after-write patterns.
Background: A NIC driver has filled four TX descriptors in host memory and must tell the device that queue tail moved from 64 to 68.
Plan: First write packet descriptors into coherent DMA memory. Then issue the needed memory barrier for descriptor visibility. Finally write the tail index to the BAR doorbell register.
| Step | State change | Why it is correct |
|---|---|---|
| 1 | Driver stores descriptor addresses, lengths, and flags into DMA ring slots 64-67. | The device will later DMA-read those cache-coherent ring entries. |
| 2 | Driver orders descriptor writes before the MMIO doorbell. | The NIC must not observe the new tail before descriptors are valid. |
| 3 | Driver writes 68 to the tail register in BAR0. | The posted MWr tells the NIC there are four new descriptors to consume. |
| 4 | NIC DMA-reads descriptors and packet bytes, then transmits packets. | No syscall or CPU copy is needed after the queue is armed. |
DMA Address Translation
A buffer has three identities: CPU virtual address for software, CPU physical address for DRAM, and DMA address for the device. With an IOMMU enabled, the DMA address is an IOVA translated through IOMMU page tables.
The IOMMU blocks devices from reaching memory they were not explicitly mapped to. That protection is why VFIO can safely hand a PCIe function to userspace.
Devices in the same IOMMU group cannot be isolated from each other, often because a bridge lacks access control services. Assigning only one device from such a group would leave a DMA escape path.
VFIO exposes the group, device regions, IRQ setup, and DMA map calls to userspace while the kernel keeps the IOMMU domain in control.
Example: A DPDK process maps 2 MB of huge-page memory at IOVA 0x10000000. The NIC RX descriptor receives that IOVA, not a C pointer. When a packet arrives, the NIC writes to IOVA 0x10000400, VT-d translates it to the pinned huge-page frame, and invalid IOVAs fault instead of corrupting arbitrary RAM.
NUMA Placement
NUMA topology decides whether a CPU core, memory buffer, and PCIe device are local to the same socket. Packet processing suffers when a core on socket 1 repeatedly touches a mempool allocated on socket 0 for a NIC attached to socket 0.
Linux exposes CPU topology as packages, dies, cores, and logical CPUs. Performance tuning starts by matching IRQs, polling threads, and memory allocation to the PCIe device locality.
| Access | Typical latency | Implication |
|---|---|---|
| L3 cache hit | ~10 ns | Keep hot queue metadata and descriptors cache-resident. |
| Local DRAM | ~60 ns | Allocate RX/TX rings and packet pools on the NIC socket. |
| Remote DRAM | ~120 ns | Remote mempools double load latency and burn inter-socket bandwidth. |
4. Minimal C Demo
5. Kernel Source Pointers
| Topic | Files / functions | What to inspect |
|---|---|---|
| PCI enumeration | drivers/pci/probe.c, pci_scan_device(), pci_setup_device() | How config space becomes struct pci_dev. |
| BAR mapping | drivers/pci/iomap.c, pci_ioremap_bar() | How drivers map device register windows. |
| DMA API | kernel/dma/mapping.c, dma_alloc_attrs(), dma_map_page_attrs() | Coherent allocation versus streaming maps. |
| IOMMU | drivers/iommu/iommu.c, iommu_map(), iommu_attach_device() | Domain attachment and IOVA to physical mappings. |
| VFIO | drivers/vfio/vfio_main.c, drivers/vfio/pci/vfio_pci_core.c | Container, group, device fd, BAR mmap, and IRQ setup. |
| NUMA topology | drivers/base/node.c, include/linux/topology.h | How node and CPU locality are exposed to drivers and sysfs. |
6. Interview Prep
| Question | Concise answer |
|---|---|
| What is a TLP in PCIe? | A Transaction Layer Packet is the PCIe unit for reads, writes, completions, config access, and messages. MRd requests data, MWr posts data, and CplD returns read data. |
| What are BAR registers? | BARs advertise device address windows. Linux assigns physical address ranges and maps them with pci_ioremap_bar() so the driver can access MMIO registers. |
| Why use write-combining for NIC doorbells? | Doorbells are small posted writes. WC can merge adjacent writes into fewer PCIe bursts and avoid expensive read completions on the fast path. |
| Virtual, physical, and bus address: what is the difference? | Virtual is what CPU software dereferences, physical is the DRAM frame, and bus or DMA address is what the device uses. The IOMMU may translate bus address to physical address. |
| Why does DPDK VFIO need the IOMMU? | VFIO lets userspace drive the device, so the kernel needs IOMMU mappings to restrict DMA to pinned memory owned by that process. |
| What is an IOMMU group? | It is the smallest set of devices that can be isolated for DMA. All devices in a group must be assigned together because they may reach each other without IOMMU separation. |
| How do you enforce NUMA locality in DPDK? | Pin lcores to the NIC socket, allocate mempools on the same socket, and verify topology with lscpu, sysfs, or EAL logs. |
What does numactl --cpunodebind=0 --membind=0 do? | It runs the process on CPUs from node 0 and asks Linux to allocate memory from node 0, keeping compute and memory local. |