§ 1.2 Kernel Boot Process
From the first C instruction in start_kernel() to userspace PID 1: early memory, CPU init, allocators, VFS, scheduler, and network stack bootstrap.
1. Overview
After the bootloader decompresses the kernel and jumps to startup_64 (x86) or _text (ARM64), a few dozen assembly instructions enable paging and branch to start_kernel() — the first C function ever called. From here the kernel must bootstrap everything from scratch: memory allocators, interrupt handling, scheduling, filesystems, and the network stack, all in a carefully ordered sequence where each subsystem may only use services already initialized before it.
The entire sequence runs on a single CPU with interrupts disabled until local_irq_enable() is explicitly called. When rest_init() is reached, the kernel spawns PID 1 (kernel_init) and PID 2 (kthreadd), then becomes the idle thread (PID 0).
2. Key Data Structures
struct boot_params — x86 Boot Protocol
On x86, the bootloader fills a 4 KB boot_params structure at physical address 0x90000 and passes its address in register rsi. setup_arch() reads it in the very first lines of kernel C code. The most important sub-structure is the e820_table — the physical memory map.
struct e820_entry — Physical Memory Map
Each entry describes one contiguous physical region. The BIOS/UEFI firmware fills these via INT 0x15, AX=0xe820. The kernel uses this map to know which RAM is free, which is reserved (MMIO, ACPI tables), and which is unusable. It is the foundation for memblock.
| Type value | Meaning | Kernel action |
|---|---|---|
1 — E820_RAM | Usable system RAM | Add to memblock; later given to buddy allocator |
2 — E820_RESERVED | Reserved by firmware / MMIO | Never touched; ioremap() maps these for drivers |
3 — E820_ACPI | ACPI tables — readable RAM | Free after ACPI init; reclaimed as normal RAM |
4 — E820_NVS | ACPI NVS — firmware needs it across S3 | Never freed; survives suspend/resume cycle |
5 — E820_UNUSABLE | Broken / bad RAM | Excluded from all allocators permanently |
memblock — The Boot-Time Allocator
Before the buddy allocator exists there must be a simpler allocator to manage memory. memblock is that allocator: it keeps two sorted arrays of physical ranges — memory[] (available) and reserved[] (in use). Allocation is a simple range split; there is no free — everything is freed at once when mem_init() hands the remaining ranges to the buddy allocator.
Field (in struct memblock) | Type | Purpose |
|---|---|---|
bottom_up | bool | Allocate from low addresses (KASLR needs top-down) |
current_limit | phys_addr_t | Upper bound for allocations (lowmem cap) |
memory | struct memblock_type | Array of available physical ranges from e820 |
reserved | struct memblock_type | Array of ranges already allocated (kernel, initrd…) |
memory.regions[] | struct memblock_region | Each region: base, size, flags, NUMA node id |
3. Core Mechanism — From memblock to Buddy
Background: The kernel needs to allocate memory for its own data structures before the buddy allocator exists (page tables, IRQ descriptors, CPU stacks…). But it also needs to give the buddy allocator the correct list of free physical pages once it is ready. memblock is the bridge: it answers early allocation requests, then transfers ownership to the buddy allocator in a single pass during mem_init().
Plan:
setup_arch()callse820__memblock_setup(): every E820_RAM range becomes amemblock.memoryentry; the kernel image, initrd, and ACPI tables are immediately reserved.- Subsystems allocate from
memblockviamemblock_alloc(size, align). The allocator finds the last free range that fits and splits it, adding the allocated portion tomemblock.reserved. mm_init()callsfree_area_init()to set up zone descriptors (DMA, Normal, HighMem on 32-bit) without touching actual pages yet.mem_init()iterates everymemblock.memoryrange, skips reserved sub-ranges, and calls__free_pages_memory()on the gaps — this puts pages into the buddy free lists.- After
mem_init()returns, the buddy allocator is live.memblockdata can still be read (for NUMA queries) but is no longer used for allocation.
Example — 512 MB RAM, kernel at 0x1000000:
| Step | memblock.memory | memblock.reserved |
|---|---|---|
| After e820__memblock_setup() | [0x0–0x9FFFF] + [0x100000–0x1FFFFFFF] | (empty) |
| Kernel image reserved | (unchanged) | [0x1000000–0x1FFFFFF] kernel text+data+bss |
| initrd reserved | (unchanged) | + [0x4000000–0x4200000] initramfs |
| memblock_alloc(4 KB) for IDT | (unchanged) | + [0x1FF000–0x1FFFFF] IDT table |
| mem_init() transfers to buddy | (read-only from now on) | gaps between reserved ranges → buddy free lists |
After mem_init(), the buddy allocator owns approximately 480 MB of free pages split across order-0 through order-10 free lists. The kernel image and initrd are permanently reserved and never returned to the allocator.
4. Minimal C Demos
Demo A — memblock Range Allocator
memblock keeps a sorted array of free ranges and a sorted array of reserved ranges. Allocation is a top-down scan: find the last free range that fits, mark the tail as reserved, return its address.
Demo B — e820 Memory Map Parser
setup_arch() walks the e820 table from boot_params.e820_table and calls memblock_add() for every E820_RAM entry. The demo below simulates this scan, computing total usable RAM and the largest contiguous free region — the same logic as e820__memblock_setup().
5. Kernel Source Pointers
| File / Function | What it does |
|---|---|
init/main.c :: start_kernel() | First C function; calls every subsystem init in order |
arch/x86/kernel/setup.c :: setup_arch() | x86 arch init: parse boot_params, e820, KASLR, NUMA |
arch/arm64/kernel/setup.c :: setup_arch() | ARM64 arch init: unflatten DTB, map memory, setup CPU |
mm/memblock.c :: memblock_alloc() | Boot-time allocator; split free range, add to reserved list |
arch/x86/kernel/e820.c :: e820__memblock_setup() | Walk e820 table and call memblock_add() for each RAM range |
mm/page_alloc.c :: free_area_init() | Initialise per-zone free_area[MAX_ORDER] buddy lists |
mm/page_alloc.c :: mem_init() | Transfer memblock free ranges to buddy allocator |
mm/slab_common.c :: kmem_cache_init() | Bootstrap slab/slub allocator using early buddy pages |
kernel/sched/core.c :: sched_init() | Create per-CPU run-queues; init CFS, RT, deadline classes |
fs/dcache.c :: vfs_caches_init() | Allocate dentry/inode hash tables; register bdev/char filesystems |
init/main.c :: rest_init() | Spawn kernel_init (PID 1) + kthreadd (PID 2); become idle loop |
init/init_task.c :: init_task | Static compile-time definition of PID 0 (swapper/idle) |
6. Interview Prep
| # | Question | Concise Answer |
|---|---|---|
| Q1 | What is the first C function the kernel runs after decompression? | start_kernel() in init/main.c. The assembly stubs in startup_32/startup_64 (x86) or head.S (ARM64) set up page tables and stack, then branch to it. It runs with interrupts off on a single CPU. |
| Q2 | What does memblock do and why is it needed before the buddy allocator? | memblock is a static boot-time allocator: two sorted arrays of physical ranges (available and reserved). It answers allocation requests before the buddy page allocator is live, then transfers all remaining free ranges to the buddy allocator in a single pass during mem_init(). |
| Q3 | Walk me through the full boot sequence from BIOS to PID 1. | BIOS/UEFI → bootloader → decompress kernel → startup_64/head.S (MMU on) → start_kernel() → setup_arch (e820/DTB, memblock) → mm_init (buddy) → kmem_cache_init (slab) → sched_init → vfs_caches_init → rest_init → kernel_init thread mounts rootfs → exec /sbin/init (PID 1). |
| Q4 | Why does start_kernel() run with interrupts disabled? | Interrupts require a live IDT (x86) or interrupt vector table (ARM64), per-CPU stacks, and a softirq mechanism — none of which exist yet. trap_init() and init_IRQ() set these up during start_kernel(); local_irq_enable() is called only after all interrupt infrastructure is ready. |
| Q5 | What is PID 0 and how does it relate to the idle thread? | PID 0 is init_task, a statically compiled task_struct. It is the bootstrap task that runs start_kernel(). After rest_init() spawns PID 1 and PID 2, init_task becomes the per-CPU idle thread: cpu_idle_loop() runs HLT (x86) / WFI (ARM) whenever no other task is runnable. |