§ 11.1–11.7 Storage Stack, Page Cache, and Async I/O
VFS objects, ext4 journaling, block-layer I/O, B+ trees, page-cache writeback, disk allocation, and coroutine-driven async storage.
1. Overview
Linux storage is a layered contract: userspace speaks file descriptors, VFS normalizes filesystem operations, ext4 maps names and logical blocks to disk blocks, and the block layer turns those mappings into DMA-friendly requests.
2. Key Data Structures
VFS Object Hierarchy
VFS separates mounted filesystem state, persistent file metadata, path-name cache entries, and per-open state into four objects.
| Object | Key fields | Purpose |
|---|---|---|
super_block | s_dev, s_root, s_op | One mounted filesystem instance and its superblock operations. |
inode | i_ino, i_mode, i_size, i_atime, i_mapping | File identity, permissions, timestamps, size, and page-cache mapping. |
dentry | d_name, d_parent, d_inode | Cached name-to-inode binding for fast path lookup. |
file | f_path, f_inode, f_pos, f_op | Per-open state: offset, flags, credentials, and file operations. |
Filesystems plug behavior into VFS through operation tables such as inode_operations for lookup/create/link and file_operations for read/write/mmap/ioctl/poll.
ext4 Extents and Block Groups
ext4 stores contiguous logical-to-physical mappings as extents, so a large sequential file can be described by a few records instead of thousands of indirect block pointers.
| Field | Type | Meaning |
|---|---|---|
ee_block | __le32 | First logical block covered by this extent. |
ee_len | __le16 | Number of contiguous blocks in the run. |
ee_start_hi / ee_start_lo | __le16 / __le32 | Physical starting block, split into high and low bits. |
A block group clusters allocation metadata with inode tables and data blocks, which keeps nearby files and directories physically close when allocation succeeds.
Block I/O Requests
The block layer represents each I/O as a bio: a target sector plus a scatter-gather list of memory ranges.
| Structure | Fields | Role |
|---|---|---|
bio | bi_bdev, bi_iter, bi_opf, bi_io_vec | Describes the disk target, sector cursor, operation flags, and memory segments. |
bio_vec | bv_page, bv_offset, bv_len | One contiguous memory slice inside a page. |
blk-mq avoids one global request queue by mapping per-CPU software queues onto hardware dispatch queues, which matches NVMe devices with many submission queues.
Page Cache, Free Space, and Coroutine State
The page cache keeps recently used file pages in memory with a two-list LRU: repeated references promote pages to the active list, while cold inactive pages become reclaim candidates.
| Object | Key fields | Purpose |
|---|---|---|
address_space | i_pages, a_ops, host | Per-file page-cache mapping from file offset to cached folios and filesystem callbacks. |
folio | flags, mapping, index, _refcount | Physically contiguous cache unit that may contain one or more base pages. |
lruvec | lists, reclaim_stat | Per-node/per-cgroup active and inactive LRU lists scanned by reclaim. |
Sequential scans should not evict the real working set, so database page caches often treat scan pages as MRU-at-the-cold-end entries that can be discarded soon after use.
Disk free space can be tracked by a compact bitmap for single pages or by a buddy allocator when contiguous extents are valuable.
| Technique | Metadata | Best use |
|---|---|---|
| Bitmap | One bit per disk page, where zero means free and one means allocated. | Simple allocation, fast free-space summaries, compact persistent metadata. |
| Disk buddy | Free lists by power-of-two extent order, plus coalescing with the buddy extent on free. | Allocating large contiguous runs for append-heavy files or columnar segments. |
A stackful coroutine stores a register snapshot and a private stack, so async disk code can yield at an I/O wait point and later resume the same call chain.
3. Core Mechanism
B+ Tree Insert with Split
A B+ tree keeps records in linked leaves while internal pages hold separator keys only; inserts stay local until a page overflows.
When a leaf overflows, the tree splits the leaf, repairs sibling pointers, and copies the first key of the right leaf into the parent.
Concurrent B+ tree code uses lock coupling: hold the parent until the child is known safe for the pending insert or delete, then release the parent and descend.
Background: A database index page can fit only a fixed number of keys, so an insert into a full leaf needs a local rewrite without losing sorted scan order.
Plan: 1. Search to the leaf. 2. Insert into a temporary overflow slot. 3. Split the sorted keys in half. 4. Link the new right leaf. 5. Copy the right leaf minimum key into the parent.
| Step | State | Why it is correct |
|---|---|---|
| Initial | Leaf contains 10, 20, 30; capacity is 3. | The leaf is sorted but full. |
| Insert 25 | Temporary leaf becomes 10, 20, 25, 30. | Overflow is allowed briefly so ordering is preserved before split. |
| Split | Left leaf gets 10, 20; right leaf gets 25, 30. | Both leaves are at or near half full. |
| Promote | Parent receives separator 25. | Searches for keys greater than or equal to 25 now choose the right leaf. |
JBD2 and Direct I/O Paths
ext4 journaling makes metadata updates crash-consistent by committing a transaction to the journal before checkpointing it back to home locations.
O_DIRECT skips the page cache and submits aligned user pages directly to the block layer.
Page Cache Writeback and Double-Write Recovery
A dirty cached page is fast for the writer but unsafe after a crash until writeback reaches durable storage; databases often add a double-write buffer to avoid torn page recovery failures.
Background: A table page may be larger than the storage atomic-write unit. If power fails halfway through the home write, the page can contain mixed old and new bytes.
Plan: 1. Write the full page to a sequential double-write area. 2. Flush that safe copy. 3. Write the page to its real location. 4. During startup, use checksums to repair any torn home page.
Recovery checks the table file against the double-write copy; if the home checksum is bad, the safe copy is restored.
| Step | State | Why it is correct |
|---|---|---|
| Dirty | Page 42 in memory contains new rows and a new checksum. | The cache can serve reads, but storage has not caught up. |
| Double-write | Page 42 is written sequentially to the double-write buffer. | A complete copy exists before the random home write starts. |
| Home write tears | The table file page has half old bytes and half new bytes. | The checksum detects that the home page is invalid. |
| Recovery | Startup copies the double-write page back to page 42. | The table file returns to a complete page image. |
Coroutine Scheduling for Async Disk I/O
Coroutines keep the CPU busy while disk requests are in flight: one coroutine submits I/O and yields, the scheduler polls completions, and the coroutine resumes only when its result is ready.
A coroutine switch is cooperative: it saves the current register and stack state in userspace, chooses another ready coroutine, and restores that coroutine without a kernel thread context switch.
Background: SSD reads can take tens or hundreds of microseconds. Blocking one kernel thread per outstanding request wastes stack memory and scheduler time.
Plan: 1. The query coroutine submits an async read. 2. It records the resume point and yields. 3. The scheduler runs other ready queries. 4. Completion polling marks the original query ready. 5. The saved stack resumes at the next statement.
| Moment | Coroutine A | Scheduler |
|---|---|---|
| Submit | Issues read for page 900 and saves its continuation. | Moves A from running to waiting. |
| Overlap | Consumes no CPU while NVMe works. | Runs coroutine B and polls the completion queue. |
| Complete | Becomes ready when the CQ entry arrives. | Restores coroutine A saved stack and registers. |
| Resume | Continues as if the read call returned synchronously. | Keeps thousands of requests in flight with few kernel threads. |
4. Minimal C Demo
This demo models the B+ tree leaf split rule: overflow locally, split into two leaves, and copy the right leaf minimum key up to the parent.
This second demo models how bio_add_page() builds a scatter-gather request before submit_bio() hands it to the block layer.
This demo models a double-write buffer: the safe sequential copy is written before the table-file page, then recovery repairs a simulated torn write.
This final demo models cooperative scheduling with explicit yield points, the same control-flow idea used by stackful coroutine runtimes around async disk I/O.
5. Kernel Source Pointers
fs/open.c:do_sys_openat2(),do_filp_open()fs/namei.c:path_lookupat(),link_path_walk(),lookup_fast()fs/ext4/extents.c:ext4_ext_map_blocks(),ext4_ext_insert_extent()fs/jbd2/commit.c:jbd2_journal_commit_transaction()block/bio.c:bio_alloc_bioset(),bio_add_page(),submit_bio()block/blk-mq.c:blk_mq_submit_bio(),blk_mq_get_tag()mm/filemap.c:filemap_get_pages(),filemap_fault()mm/vmscan.c:shrink_node(),shrink_folio_list()mm/page-writeback.c:balance_dirty_pages(),writeback_sb_inodes()fs/ext4/mballoc.c:ext4_mb_new_blocks(),ext4_mb_free_blocks()io_uring/io_uring.c:io_submit_sqes(),io_cqring_wait()
6. Interview Prep
| Question | Concise answer |
|---|---|
| What are the four VFS objects? | super_block is the mount, inode is file metadata, dentry maps names to inodes, and file is per-open state. |
| How does the dentry cache accelerate lookup? | It caches each path component, including negative lookups, so common opens avoid repeated filesystem directory scans. |
| What is an ext4 extent? | A compact mapping from a logical block range to a contiguous physical block range, replacing long indirect pointer chains for large files. |
| What are the JBD2 data modes? | writeback journals metadata only, ordered writes data before metadata commit, and journal journals both data and metadata. |
| What is lock coupling in a B+ tree? | Acquire the child before releasing the parent; keep the parent if the child might split or merge and affect parent separators. |
| Why use two LRU lists? | The inactive list gives one-time pages a place to die quickly, while the active list protects pages that are referenced repeatedly. |
| Why use MRU behavior for sequential scans? | Scan pages are unlikely to be reused after the cursor passes, so they should not evict hot index or working-set pages. |
| What problem does a double-write buffer solve? | It gives recovery a complete page image when the table-file home write is torn by a crash. |
| Bitmap or buddy for disk allocation? | Use a bitmap for compact single-page tracking; use a buddy-style extent allocator when contiguous runs reduce fragmentation and improve scan I/O. |
| How does a coroutine differ from a thread? | A coroutine yields explicitly in userspace and resumes a saved stack; a thread is scheduled preemptively by the kernel. |