Tech Notes

1. Overview

Linux storage is a layered contract: userspace speaks file descriptors, VFS normalizes filesystem operations, ext4 maps names and logical blocks to disk blocks, and the block layer turns those mappings into DMA-friendly requests.

2. Key Data Structures

VFS Object Hierarchy

VFS separates mounted filesystem state, persistent file metadata, path-name cache entries, and per-open state into four objects.

Object	Key fields	Purpose
`super_block`	`s_dev`, `s_root`, `s_op`	One mounted filesystem instance and its superblock operations.
`inode`	`i_ino`, `i_mode`, `i_size`, `i_atime`, `i_mapping`	File identity, permissions, timestamps, size, and page-cache mapping.
`dentry`	`d_name`, `d_parent`, `d_inode`	Cached name-to-inode binding for fast path lookup.
`file`	`f_path`, `f_inode`, `f_pos`, `f_op`	Per-open state: offset, flags, credentials, and file operations.

Filesystems plug behavior into VFS through operation tables such as inode_operations for lookup/create/link and file_operations for read/write/mmap/ioctl/poll.

ext4 Extents and Block Groups

ext4 stores contiguous logical-to-physical mappings as extents, so a large sequential file can be described by a few records instead of thousands of indirect block pointers.

Field	Type	Meaning
`ee_block`	`__le32`	First logical block covered by this extent.
`ee_len`	`__le16`	Number of contiguous blocks in the run.
`ee_start_hi` / `ee_start_lo`	`__le16` / `__le32`	Physical starting block, split into high and low bits.

A block group clusters allocation metadata with inode tables and data blocks, which keeps nearby files and directories physically close when allocation succeeds.

Block I/O Requests

The block layer represents each I/O as a bio: a target sector plus a scatter-gather list of memory ranges.

Structure	Fields	Role
`bio`	`bi_bdev`, `bi_iter`, `bi_opf`, `bi_io_vec`	Describes the disk target, sector cursor, operation flags, and memory segments.
`bio_vec`	`bv_page`, `bv_offset`, `bv_len`	One contiguous memory slice inside a page.

blk-mq avoids one global request queue by mapping per-CPU software queues onto hardware dispatch queues, which matches NVMe devices with many submission queues.

Page Cache, Free Space, and Coroutine State

The page cache keeps recently used file pages in memory with a two-list LRU: repeated references promote pages to the active list, while cold inactive pages become reclaim candidates.

Object	Key fields	Purpose
`address_space`	`i_pages`, `a_ops`, `host`	Per-file page-cache mapping from file offset to cached folios and filesystem callbacks.
`folio`	`flags`, `mapping`, `index`, `_refcount`	Physically contiguous cache unit that may contain one or more base pages.
`lruvec`	`lists`, `reclaim_stat`	Per-node/per-cgroup active and inactive LRU lists scanned by reclaim.

Sequential scans should not evict the real working set, so database page caches often treat scan pages as MRU-at-the-cold-end entries that can be discarded soon after use.

Disk free space can be tracked by a compact bitmap for single pages or by a buddy allocator when contiguous extents are valuable.

Technique	Metadata	Best use
Bitmap	One bit per disk page, where zero means free and one means allocated.	Simple allocation, fast free-space summaries, compact persistent metadata.
Disk buddy	Free lists by power-of-two extent order, plus coalescing with the buddy extent on free.	Allocating large contiguous runs for append-heavy files or columnar segments.

A stackful coroutine stores a register snapshot and a private stack, so async disk code can yield at an I/O wait point and later resume the same call chain.

3. Core Mechanism

B+ Tree Insert with Split

A B+ tree keeps records in linked leaves while internal pages hold separator keys only; inserts stay local until a page overflows.

When a leaf overflows, the tree splits the leaf, repairs sibling pointers, and copies the first key of the right leaf into the parent.

Concurrent B+ tree code uses lock coupling: hold the parent until the child is known safe for the pending insert or delete, then release the parent and descend.

Background: A database index page can fit only a fixed number of keys, so an insert into a full leaf needs a local rewrite without losing sorted scan order.

Plan: 1. Search to the leaf. 2. Insert into a temporary overflow slot. 3. Split the sorted keys in half. 4. Link the new right leaf. 5. Copy the right leaf minimum key into the parent.

Step	State	Why it is correct
Initial	Leaf contains `10, 20, 30`; capacity is 3.	The leaf is sorted but full.
Insert 25	Temporary leaf becomes `10, 20, 25, 30`.	Overflow is allowed briefly so ordering is preserved before split.
Split	Left leaf gets `10, 20`; right leaf gets `25, 30`.	Both leaves are at or near half full.
Promote	Parent receives separator `25`.	Searches for keys greater than or equal to 25 now choose the right leaf.

JBD2 and Direct I/O Paths

ext4 journaling makes metadata updates crash-consistent by committing a transaction to the journal before checkpointing it back to home locations.

O_DIRECT skips the page cache and submits aligned user pages directly to the block layer.

Page Cache Writeback and Double-Write Recovery

A dirty cached page is fast for the writer but unsafe after a crash until writeback reaches durable storage; databases often add a double-write buffer to avoid torn page recovery failures.

Background: A table page may be larger than the storage atomic-write unit. If power fails halfway through the home write, the page can contain mixed old and new bytes.

Plan: 1. Write the full page to a sequential double-write area. 2. Flush that safe copy. 3. Write the page to its real location. 4. During startup, use checksums to repair any torn home page.

Recovery checks the table file against the double-write copy; if the home checksum is bad, the safe copy is restored.

Step	State	Why it is correct
Dirty	Page 42 in memory contains new rows and a new checksum.	The cache can serve reads, but storage has not caught up.
Double-write	Page 42 is written sequentially to the double-write buffer.	A complete copy exists before the random home write starts.
Home write tears	The table file page has half old bytes and half new bytes.	The checksum detects that the home page is invalid.
Recovery	Startup copies the double-write page back to page 42.	The table file returns to a complete page image.

Coroutine Scheduling for Async Disk I/O

Coroutines keep the CPU busy while disk requests are in flight: one coroutine submits I/O and yields, the scheduler polls completions, and the coroutine resumes only when its result is ready.

A coroutine switch is cooperative: it saves the current register and stack state in userspace, chooses another ready coroutine, and restores that coroutine without a kernel thread context switch.

Background: SSD reads can take tens or hundreds of microseconds. Blocking one kernel thread per outstanding request wastes stack memory and scheduler time.

Plan: 1. The query coroutine submits an async read. 2. It records the resume point and yields. 3. The scheduler runs other ready queries. 4. Completion polling marks the original query ready. 5. The saved stack resumes at the next statement.

Moment	Coroutine A	Scheduler
Submit	Issues read for page 900 and saves its continuation.	Moves A from running to waiting.
Overlap	Consumes no CPU while NVMe works.	Runs coroutine B and polls the completion queue.
Complete	Becomes ready when the CQ entry arrives.	Restores coroutine A saved stack and registers.
Resume	Continues as if the read call returned synchronously.	Keeps thousands of requests in flight with few kernel threads.

4. Minimal C Demo

This demo models the B+ tree leaf split rule: overflow locally, split into two leaves, and copy the right leaf minimum key up to the parent.

B+ Tree Leaf Split — C Demo

stdin (optional)

This second demo models how bio_add_page() builds a scatter-gather request before submit_bio() hands it to the block layer.

bio_vec Scatter-Gather Request — C Demo

stdin (optional)

This demo models a double-write buffer: the safe sequential copy is written before the table-file page, then recovery repairs a simulated torn write.

Double-Write Buffer Recovery — C Demo

stdin (optional)

This final demo models cooperative scheduling with explicit yield points, the same control-flow idea used by stackful coroutine runtimes around async disk I/O.

Coroutine Scheduler Model — C Demo

stdin (optional)

5. Kernel Source Pointers

fs/open.c: do_sys_openat2(), do_filp_open()
fs/namei.c: path_lookupat(), link_path_walk(), lookup_fast()
fs/ext4/extents.c: ext4_ext_map_blocks(), ext4_ext_insert_extent()
fs/jbd2/commit.c: jbd2_journal_commit_transaction()
block/bio.c: bio_alloc_bioset(), bio_add_page(), submit_bio()
block/blk-mq.c: blk_mq_submit_bio(), blk_mq_get_tag()
mm/filemap.c: filemap_get_pages(), filemap_fault()
mm/vmscan.c: shrink_node(), shrink_folio_list()
mm/page-writeback.c: balance_dirty_pages(), writeback_sb_inodes()
fs/ext4/mballoc.c: ext4_mb_new_blocks(), ext4_mb_free_blocks()
io_uring/io_uring.c: io_submit_sqes(), io_cqring_wait()

6. Interview Prep

Question	Concise answer
What are the four VFS objects?	`super_block` is the mount, `inode` is file metadata, `dentry` maps names to inodes, and `file` is per-open state.
How does the dentry cache accelerate lookup?	It caches each path component, including negative lookups, so common opens avoid repeated filesystem directory scans.
What is an ext4 extent?	A compact mapping from a logical block range to a contiguous physical block range, replacing long indirect pointer chains for large files.
What are the JBD2 data modes?	`writeback` journals metadata only, `ordered` writes data before metadata commit, and `journal` journals both data and metadata.
What is lock coupling in a B+ tree?	Acquire the child before releasing the parent; keep the parent if the child might split or merge and affect parent separators.
Why use two LRU lists?	The inactive list gives one-time pages a place to die quickly, while the active list protects pages that are referenced repeatedly.
Why use MRU behavior for sequential scans?	Scan pages are unlikely to be reused after the cursor passes, so they should not evict hot index or working-set pages.
What problem does a double-write buffer solve?	It gives recovery a complete page image when the table-file home write is torn by a crash.
Bitmap or buddy for disk allocation?	Use a bitmap for compact single-page tracking; use a buddy-style extent allocator when contiguous runs reduce fragmentation and improve scan I/O.
How does a coroutine differ from a thread?	A coroutine yields explicitly in userspace and resumes a saved stack; a thread is scheduled preemptively by the kernel.

§ 11.1–11.7 Storage Stack, Page Cache, and Async I/O

1. Overview

2. Key Data Structures

VFS Object Hierarchy

ext4 Extents and Block Groups

Block I/O Requests

Page Cache, Free Space, and Coroutine State

3. Core Mechanism

B+ Tree Insert with Split

JBD2 and Direct I/O Paths

Page Cache Writeback and Double-Write Recovery

Coroutine Scheduling for Async Disk I/O

4. Minimal C Demo

5. Kernel Source Pointers

6. Interview Prep