Part XII - Virtualization

12.1-12.6 Virtualization, Containers, and CNI

KVM and EPT, live migration, Linux namespaces, cgroup v2, container networking, and storage virtualization.

1. Overview

Linux virtualization splits the VM into a userspace device model and a kernel execution engine. QEMU owns VM construction, emulated devices, and migration streams, while KVM enters guest mode, handles VM exits, and maintains the nested MMU state that maps guest memory onto host pages.

Containers use a different boundary. Instead of emulating a machine, Linux gives ordinary processes separate kernel views through namespaces, then limits shared host resources through cgroup v2 controllers.

2. Key Data Structures

A KVM virtual machine starts at /dev/kvm. QEMU creates a struct kvm, registers guest RAM as memslots, creates one struct kvm_vcpu per virtual CPU, and shares a struct kvm_run page with the kernel for exits that need userspace emulation.

StructureKey fieldsPurpose
struct kvmmemslots, irqchip, mmu notifier statePer-VM kernel object that owns guest memory layout and virtual platform state.
struct kvm_vcpuregister cache, requests, architecture stateOne runnable virtual CPU, normally driven by one QEMU thread.
struct kvm_runexit_reason, I/O fields, MMIO fieldsShared page used to report exits from KVM to userspace.
VMCSguest state, host state, controls, exit infoHardware control block loaded by VMX on VM entry and VM exit.

VMCS fields are grouped by responsibility: some fields are restored into the guest on entry, some restore the host on exit, and the control fields decide which guest operations trap back to KVM.

VMEXIT reasonTypical handlerWhy it exits
I/O port accessOften QEMU userspaceLegacy devices such as serial ports and PIT are emulated by the device model.
MMIO accessKVM fast path or QEMUGuest touches a device register range rather than normal RAM.
CPUIDKVMHypervisor filters CPU features exposed to the guest.
HLTKVM scheduler pathIdle guest vCPU can sleep until an interrupt is injected.
EPT violationKVM MMUGuest physical access has no valid nested translation or violates permissions.
HypercallKVM paravirt pathGuest explicitly requests hypervisor service.

EPT is a second page table owned by the hypervisor. Guest page tables translate guest virtual addresses to guest physical addresses; EPT then translates those guest physical addresses to host physical pages.

Ballooning and KSM are memory overcommit tools. A balloon driver voluntarily pins guest pages so the host can reclaim their backing pages, while KSM scans for identical anonymous pages and maps them copy-on-write.

A network namespace owns a separate struct net. That object points at namespace-local routing tables, socket tables, protocol state, and network devices, so a container can have an eth0 that is unrelated to the host physical eth0.

StructureKey fieldsPurpose
struct pid_namespacechild_reaper, level, parentGives each container its own PID tree and PID 1 reaper semantics.
struct netdev_base_head, ipv4, ipv6, netfilter stateHolds per-network-namespace interfaces, routes, sockets, and protocol state.
struct cgroupkn, root, subtree_control, css_setsNode in the unified cgroup v2 hierarchy that groups processes for resource control.
struct cgroup_subsys_statecgroup, refcnt, parent, idController-specific state for CPU, memory, I/O, and other cgroup subsystems.
struct virtqueuedescriptor table, avail ring, used ringShared ring used by virtio devices to submit and complete guest I/O buffers.

Cgroup v2 is a single hierarchy. Controllers such as CPU, memory, and I/O attach to the same tree, so orchestration systems can place a pod or container in one cgroup and apply all resource limits there.

Virtio block storage uses the same split as virtio networking: the guest driver fills virtqueue descriptors, notifies the backend, and receives completions through the used ring.

3. Core Mechanism

The hardest point is the VM run loop: guest execution is not interpretation. KVM asks the CPU to enter guest mode; hardware runs guest instructions directly until a configured event forces a VM exit.

Background

Suppose the guest writes to a virtio device register. A normal RAM write should stay in guest mode, but a device register write must be emulated by KVM or QEMU. VMX controls make that register range exit without trapping every instruction.

Plan

  1. QEMU calls KVM_RUN on the vCPU file descriptor.
  2. KVM loads pending interrupt, register, and MMU state into VMCS fields.
  3. The CPU executes the guest until a VMEXIT condition occurs.
  4. KVM handles exits that are pure kernel work, such as EPT faults or idle HLT.
  5. If a device model is needed, KVM fills struct kvm_run and returns to QEMU.

Walkthrough

For an EPT violation, the CPU has already translated through the guest page table. The missing part is the second translation from GPA to HPA, so the VM exits to KVM, KVM installs an EPT entry, and the same guest instruction is retried.

StepState ChangeResult
1Guest touches GPA 0x4000.Guest page table walk succeeds.
2EPT walk finds no present entry.Hardware records an EPT violation exit qualification.
3KVM resolves the GPA through the VM memslot.A host page is allocated, pinned, or located.
4KVM installs the EPT PTE and resumes the vCPU.The original instruction completes without guest-visible fault.

Live migration uses the same separation of responsibilities: KVM tracks which guest pages changed, QEMU copies memory and device state, and the final stop-and-copy window transfers the last dirty pages plus vCPU state.

The downtime window is intentionally small and late. Most RAM moves while the source VM is still running; only the final dirty set and CPU/device state require pausing the source.

In a network-heavy migration, memory is not the whole state. If the VM datapath can be XDP, OVS, or a mixed transition, the migration controller must preserve map entries, flow tables, tunnel state, and packet steering so the destination host sees the same traffic contract.

Container Isolation Walkthrough

The hardest container concept is that isolation is assembled from multiple kernel views. A process can be root inside a user namespace, PID 1 inside a PID namespace, and attached to a veth device inside a network namespace while still sharing the same host kernel.

Background

Suppose Kubernetes starts a pod. The container runtime creates namespaces for process and filesystem isolation, then the CNI plugin creates a veth pair so the pod has a normal-looking eth0.

Plan

  1. Create the container process with clone() flags such as CLONE_NEWPID, CLONE_NEWNET, and CLONE_NEWNS.
  2. Attach the process to a cgroup v2 node and write controller files such as memory.max, cpu.weight, and io.max.
  3. Create a veth pair, move one endpoint into the pod network namespace, and attach the host endpoint to a bridge or eBPF datapath.
  4. Program routing, policy, service load balancing, and overlays through the CNI plugin.

Example

For two pods on the same node, the packet never needs the physical NIC. It exits pod A through the veth pair, the Linux bridge learns the destination MAC, and the frame enters pod B through the second veth.

Across nodes, the CNI plugin wraps the pod packet in an underlay packet, commonly VXLAN or IPIP. The remote node decapsulates it and delivers the original pod packet to the destination veth.

Cilium replaces much of the iptables and kube-proxy path with eBPF programs and maps. Service load balancing, policy, and connection tracking are map lookups in the datapath rather than long rule-chain traversal.

Storage virtualization has a similar choice: emulate a block device through QEMU for compatibility, move the backend to userspace polling with vhost-user-blk and SPDK for throughput, or pass a physical NVMe device directly to the VM with VFIO.

NVMe pass-through removes most host storage virtualization overhead, but it also gives the guest ownership of the device function. The IOMMU is the safety boundary that restricts DMA to the guest memory assigned by VFIO.

4. Minimal C Demo

This demo models the KVM ioctl lifecycle and the shared kvm_run exit buffer. It is not using real kernel headers; it keeps only the control flow that matters.

KVM ioctl Run Loop — C Demo
stdin (optional)

This demo models dirty page logging during pre-copy migration. Each round copies the current dirty bitmap, clears it, and lets the next round capture newly written pages.

Dirty Page Tracking for Pre-copy Migration — C Demo
stdin (optional)

This demo models the cgroup v2 file interface. Real runtimes write controller files under /sys/fs/cgroup; the principle is that limits are normal kernel control files, not a separate daemon API.

Cgroup v2 Limit Files — C Demo
stdin (optional)

This demo models how a veth pair gives a pod a private eth0 while keeping the peer endpoint in the host network namespace for bridge, routing, or eBPF handling.

Veth Pair Placement — C Demo
stdin (optional)

5. Kernel Source Pointers

Path / FunctionWhat to Read
virt/kvm/kvm_main.cVM and vCPU file descriptors, memory slots, KVM_RUN dispatch.
arch/x86/kvm/vmx/vmx.cVMX entry/exit handling, VMCS setup, common VMEXIT handlers.
arch/x86/kvm/mmu/mmu.cKVM MMU, nested page fault handling, shadow/EPT page table management.
include/uapi/linux/kvm.hUserspace ABI: struct kvm_run, ioctls, exit reasons, dirty log APIs.
Documentation/virt/kvm/api.rstKVM userspace API, vCPU lifecycle, memory slots, dirty logging.
mm/ksm.cKernel Samepage Merging scanner and copy-on-write behavior.
drivers/virtio/virtio_balloon.cGuest balloon driver inflate/deflate path.
kernel/nsproxy.c / copy_namespaces()Namespace creation during clone and unshare operations.
net/core/net_namespace.cstruct net lifecycle, per-net operations, network namespace cleanup.
kernel/cgroup/cgroup.cCgroup v2 hierarchy, controller enablement, process migration between cgroups.
drivers/net/veth.cVeth peer forwarding path used by container network namespaces.
drivers/block/virtio_blk.cVirtio block driver request submission, virtqueue usage, multi-queue support.
drivers/vfio/VFIO device assignment and IOMMU-backed DMA isolation for pass-through devices.
migration/ram.c in QEMUPre-copy RAM migration, dirty bitmap iteration, stop-and-copy logic.

6. Interview Prep

QuestionConcise Answer
What is VMCS and what fields does it contain?VMCS is Intel VMX hardware state for a vCPU. It contains guest state, host state, execution controls, entry controls, exit controls, and exit information.
Walk through a VMEXIT.The guest runs directly on the CPU until a configured event occurs. Hardware saves guest state, restores host state, records the exit reason, and KVM either handles it or returns to QEMU through struct kvm_run.
What is EPT?EPT is nested paging. The guest page table maps GVA to GPA, and EPT maps GPA to HPA. It avoids trapping every guest page table update, unlike old shadow page table designs.
What happens on an EPT violation?The CPU exits because the GPA has no valid EPT translation or violates permissions. KVM resolves the memslot, installs or updates the EPT entry, and resumes the guest instruction.
Explain pre-copy live migration.QEMU copies all RAM while the VM runs, repeatedly copies pages dirtied since the previous round, then briefly pauses the source to copy final dirty pages plus CPU and device state before resuming on the destination.
Why is network state hard in live migration?Packet steering and policy can live outside guest RAM in XDP maps, OVS flow tables, tap/vhost queues, and tunnel state. The destination must receive equivalent datapath state before traffic is cut over.
What are Linux namespaces?Namespaces give processes private kernel views: PID, network, mount, UTS, IPC, and user namespaces isolate process IDs, network stacks, mount tables, hostnames, IPC objects, and UID/GID mappings.
What is struct net?It is the per-network-namespace kernel object. It owns namespace-local devices, routes, sockets, netfilter state, and protocol tables.
How does cgroup v2 limit resources?Processes are placed in a unified cgroup tree. Controllers expose files such as cpu.weight, memory.max, and io.max that the scheduler, reclaim path, and block layer enforce.
Explain veth pairs.A veth pair is two linked virtual Ethernet devices. One endpoint usually lives inside the container network namespace as eth0, and the peer stays on the host for bridge, routing, or eBPF processing.
How does Cilium replace kube-proxy?Cilium attaches eBPF programs and uses BPF maps for services, backends, policy, and conntrack, so service translation and policy happen in the datapath instead of iptables chains.
What is virtio-blk versus NVMe pass-through?Virtio-blk is a paravirtual block device using virtqueues and a host backend. NVMe pass-through assigns a physical NVMe PCI function to the guest through VFIO, with the IOMMU enforcing DMA isolation.