12.1-12.6 Virtualization, Containers, and CNI
KVM and EPT, live migration, Linux namespaces, cgroup v2, container networking, and storage virtualization.
1. Overview
Linux virtualization splits the VM into a userspace device model and a kernel execution engine. QEMU owns VM construction, emulated devices, and migration streams, while KVM enters guest mode, handles VM exits, and maintains the nested MMU state that maps guest memory onto host pages.
Containers use a different boundary. Instead of emulating a machine, Linux gives ordinary processes separate kernel views through namespaces, then limits shared host resources through cgroup v2 controllers.
2. Key Data Structures
A KVM virtual machine starts at /dev/kvm. QEMU creates a struct kvm, registers guest RAM as memslots, creates one struct kvm_vcpu per virtual CPU, and shares a struct kvm_run page with the kernel for exits that need userspace emulation.
| Structure | Key fields | Purpose |
|---|---|---|
struct kvm | memslots, irqchip, mmu notifier state | Per-VM kernel object that owns guest memory layout and virtual platform state. |
struct kvm_vcpu | register cache, requests, architecture state | One runnable virtual CPU, normally driven by one QEMU thread. |
struct kvm_run | exit_reason, I/O fields, MMIO fields | Shared page used to report exits from KVM to userspace. |
VMCS | guest state, host state, controls, exit info | Hardware control block loaded by VMX on VM entry and VM exit. |
VMCS fields are grouped by responsibility: some fields are restored into the guest on entry, some restore the host on exit, and the control fields decide which guest operations trap back to KVM.
| VMEXIT reason | Typical handler | Why it exits |
|---|---|---|
| I/O port access | Often QEMU userspace | Legacy devices such as serial ports and PIT are emulated by the device model. |
| MMIO access | KVM fast path or QEMU | Guest touches a device register range rather than normal RAM. |
CPUID | KVM | Hypervisor filters CPU features exposed to the guest. |
HLT | KVM scheduler path | Idle guest vCPU can sleep until an interrupt is injected. |
| EPT violation | KVM MMU | Guest physical access has no valid nested translation or violates permissions. |
| Hypercall | KVM paravirt path | Guest explicitly requests hypervisor service. |
EPT is a second page table owned by the hypervisor. Guest page tables translate guest virtual addresses to guest physical addresses; EPT then translates those guest physical addresses to host physical pages.
Ballooning and KSM are memory overcommit tools. A balloon driver voluntarily pins guest pages so the host can reclaim their backing pages, while KSM scans for identical anonymous pages and maps them copy-on-write.
A network namespace owns a separate struct net. That object points at namespace-local routing tables, socket tables, protocol state, and network devices, so a container can have an eth0 that is unrelated to the host physical eth0.
| Structure | Key fields | Purpose |
|---|---|---|
struct pid_namespace | child_reaper, level, parent | Gives each container its own PID tree and PID 1 reaper semantics. |
struct net | dev_base_head, ipv4, ipv6, netfilter state | Holds per-network-namespace interfaces, routes, sockets, and protocol state. |
struct cgroup | kn, root, subtree_control, css_sets | Node in the unified cgroup v2 hierarchy that groups processes for resource control. |
struct cgroup_subsys_state | cgroup, refcnt, parent, id | Controller-specific state for CPU, memory, I/O, and other cgroup subsystems. |
struct virtqueue | descriptor table, avail ring, used ring | Shared ring used by virtio devices to submit and complete guest I/O buffers. |
Cgroup v2 is a single hierarchy. Controllers such as CPU, memory, and I/O attach to the same tree, so orchestration systems can place a pod or container in one cgroup and apply all resource limits there.
Virtio block storage uses the same split as virtio networking: the guest driver fills virtqueue descriptors, notifies the backend, and receives completions through the used ring.
3. Core Mechanism
The hardest point is the VM run loop: guest execution is not interpretation. KVM asks the CPU to enter guest mode; hardware runs guest instructions directly until a configured event forces a VM exit.
Background
Suppose the guest writes to a virtio device register. A normal RAM write should stay in guest mode, but a device register write must be emulated by KVM or QEMU. VMX controls make that register range exit without trapping every instruction.
Plan
- QEMU calls
KVM_RUNon the vCPU file descriptor. - KVM loads pending interrupt, register, and MMU state into VMCS fields.
- The CPU executes the guest until a VMEXIT condition occurs.
- KVM handles exits that are pure kernel work, such as EPT faults or idle
HLT. - If a device model is needed, KVM fills
struct kvm_runand returns to QEMU.
Walkthrough
For an EPT violation, the CPU has already translated through the guest page table. The missing part is the second translation from GPA to HPA, so the VM exits to KVM, KVM installs an EPT entry, and the same guest instruction is retried.
| Step | State Change | Result |
|---|---|---|
| 1 | Guest touches GPA 0x4000. | Guest page table walk succeeds. |
| 2 | EPT walk finds no present entry. | Hardware records an EPT violation exit qualification. |
| 3 | KVM resolves the GPA through the VM memslot. | A host page is allocated, pinned, or located. |
| 4 | KVM installs the EPT PTE and resumes the vCPU. | The original instruction completes without guest-visible fault. |
Live migration uses the same separation of responsibilities: KVM tracks which guest pages changed, QEMU copies memory and device state, and the final stop-and-copy window transfers the last dirty pages plus vCPU state.
The downtime window is intentionally small and late. Most RAM moves while the source VM is still running; only the final dirty set and CPU/device state require pausing the source.
In a network-heavy migration, memory is not the whole state. If the VM datapath can be XDP, OVS, or a mixed transition, the migration controller must preserve map entries, flow tables, tunnel state, and packet steering so the destination host sees the same traffic contract.
Container Isolation Walkthrough
The hardest container concept is that isolation is assembled from multiple kernel views. A process can be root inside a user namespace, PID 1 inside a PID namespace, and attached to a veth device inside a network namespace while still sharing the same host kernel.
Background
Suppose Kubernetes starts a pod. The container runtime creates namespaces for process and filesystem isolation, then the CNI plugin creates a veth pair so the pod has a normal-looking eth0.
Plan
- Create the container process with
clone()flags such asCLONE_NEWPID,CLONE_NEWNET, andCLONE_NEWNS. - Attach the process to a cgroup v2 node and write controller files such as
memory.max,cpu.weight, andio.max. - Create a veth pair, move one endpoint into the pod network namespace, and attach the host endpoint to a bridge or eBPF datapath.
- Program routing, policy, service load balancing, and overlays through the CNI plugin.
Example
For two pods on the same node, the packet never needs the physical NIC. It exits pod A through the veth pair, the Linux bridge learns the destination MAC, and the frame enters pod B through the second veth.
Across nodes, the CNI plugin wraps the pod packet in an underlay packet, commonly VXLAN or IPIP. The remote node decapsulates it and delivers the original pod packet to the destination veth.
Cilium replaces much of the iptables and kube-proxy path with eBPF programs and maps. Service load balancing, policy, and connection tracking are map lookups in the datapath rather than long rule-chain traversal.
Storage virtualization has a similar choice: emulate a block device through QEMU for compatibility, move the backend to userspace polling with vhost-user-blk and SPDK for throughput, or pass a physical NVMe device directly to the VM with VFIO.
NVMe pass-through removes most host storage virtualization overhead, but it also gives the guest ownership of the device function. The IOMMU is the safety boundary that restricts DMA to the guest memory assigned by VFIO.
4. Minimal C Demo
This demo models the KVM ioctl lifecycle and the shared kvm_run exit buffer. It is not using real kernel headers; it keeps only the control flow that matters.
This demo models dirty page logging during pre-copy migration. Each round copies the current dirty bitmap, clears it, and lets the next round capture newly written pages.
This demo models the cgroup v2 file interface. Real runtimes write controller files under /sys/fs/cgroup; the principle is that limits are normal kernel control files, not a separate daemon API.
This demo models how a veth pair gives a pod a private eth0 while keeping the peer endpoint in the host network namespace for bridge, routing, or eBPF handling.
5. Kernel Source Pointers
| Path / Function | What to Read |
|---|---|
virt/kvm/kvm_main.c | VM and vCPU file descriptors, memory slots, KVM_RUN dispatch. |
arch/x86/kvm/vmx/vmx.c | VMX entry/exit handling, VMCS setup, common VMEXIT handlers. |
arch/x86/kvm/mmu/mmu.c | KVM MMU, nested page fault handling, shadow/EPT page table management. |
include/uapi/linux/kvm.h | Userspace ABI: struct kvm_run, ioctls, exit reasons, dirty log APIs. |
Documentation/virt/kvm/api.rst | KVM userspace API, vCPU lifecycle, memory slots, dirty logging. |
mm/ksm.c | Kernel Samepage Merging scanner and copy-on-write behavior. |
drivers/virtio/virtio_balloon.c | Guest balloon driver inflate/deflate path. |
kernel/nsproxy.c / copy_namespaces() | Namespace creation during clone and unshare operations. |
net/core/net_namespace.c | struct net lifecycle, per-net operations, network namespace cleanup. |
kernel/cgroup/cgroup.c | Cgroup v2 hierarchy, controller enablement, process migration between cgroups. |
drivers/net/veth.c | Veth peer forwarding path used by container network namespaces. |
drivers/block/virtio_blk.c | Virtio block driver request submission, virtqueue usage, multi-queue support. |
drivers/vfio/ | VFIO device assignment and IOMMU-backed DMA isolation for pass-through devices. |
migration/ram.c in QEMU | Pre-copy RAM migration, dirty bitmap iteration, stop-and-copy logic. |
6. Interview Prep
| Question | Concise Answer |
|---|---|
| What is VMCS and what fields does it contain? | VMCS is Intel VMX hardware state for a vCPU. It contains guest state, host state, execution controls, entry controls, exit controls, and exit information. |
| Walk through a VMEXIT. | The guest runs directly on the CPU until a configured event occurs. Hardware saves guest state, restores host state, records the exit reason, and KVM either handles it or returns to QEMU through struct kvm_run. |
| What is EPT? | EPT is nested paging. The guest page table maps GVA to GPA, and EPT maps GPA to HPA. It avoids trapping every guest page table update, unlike old shadow page table designs. |
| What happens on an EPT violation? | The CPU exits because the GPA has no valid EPT translation or violates permissions. KVM resolves the memslot, installs or updates the EPT entry, and resumes the guest instruction. |
| Explain pre-copy live migration. | QEMU copies all RAM while the VM runs, repeatedly copies pages dirtied since the previous round, then briefly pauses the source to copy final dirty pages plus CPU and device state before resuming on the destination. |
| Why is network state hard in live migration? | Packet steering and policy can live outside guest RAM in XDP maps, OVS flow tables, tap/vhost queues, and tunnel state. The destination must receive equivalent datapath state before traffic is cut over. |
| What are Linux namespaces? | Namespaces give processes private kernel views: PID, network, mount, UTS, IPC, and user namespaces isolate process IDs, network stacks, mount tables, hostnames, IPC objects, and UID/GID mappings. |
What is struct net? | It is the per-network-namespace kernel object. It owns namespace-local devices, routes, sockets, netfilter state, and protocol tables. |
| How does cgroup v2 limit resources? | Processes are placed in a unified cgroup tree. Controllers expose files such as cpu.weight, memory.max, and io.max that the scheduler, reclaim path, and block layer enforce. |
| Explain veth pairs. | A veth pair is two linked virtual Ethernet devices. One endpoint usually lives inside the container network namespace as eth0, and the peer stays on the host for bridge, routing, or eBPF processing. |
| How does Cilium replace kube-proxy? | Cilium attaches eBPF programs and uses BPF maps for services, backends, policy, and conntrack, so service translation and policy happen in the datapath instead of iptables chains. |
| What is virtio-blk versus NVMe pass-through? | Virtio-blk is a paravirtual block device using virtqueues and a host backend. NVMe pass-through assigns a physical NVMe PCI function to the guest through VFIO, with the IOMMU enforcing DMA isolation. |