Tech Notes

1. Overview

File I/O starts with an integer file descriptor, but the kernel work happens below it: a per-process descriptor table points to an open file description, the description points through VFS to an inode, and the inode owns metadata plus page-cache mappings.

2. Key Data Structures

The important distinction is descriptor slot versus open file description. dup() and fork() create new descriptor references to the same description, while a second open() usually creates a fresh description.

Kernel Object	Main Fields	Scope	Why It Matters
FD table	array index, close-on-exec bit, pointer to file	per process, copied on fork	Controls which integer names are open and which survive exec.
Open file description	file offset, status flags, refcount, access mode	shared by dup and fork aliases	`O_APPEND`, `O_NONBLOCK`, and offsets live here.
inode	device, inode number, mode, size, mapping	filesystem object	Identifies the file and connects I/O to metadata and page cache.
page cache	file offset to page, dirty state, writeback state	shared kernel cache	Normal reads and writes are cached here before storage I/O.

fork Copies Descriptor Slots, Not Offsets

After fork(), parent and child have separate FD tables, but each copied slot points at the same open file description, so offset movement is visible to both.

3. Core Mechanism

Background: Most file I/O bugs are scope bugs: the programmer changes a descriptor-local thing when the shared open file description matters, or assumes a shared offset is private.

Plan: First decide whether a call creates a new open file description or aliases an existing one. Then decide whether the operation uses the implicit shared offset or an explicit offset. Finally, set close-on-exec atomically at creation time when a descriptor must not leak into a new program image.

Example: A shell opening out.txt for cmd > out.txt creates one write-only description, duplicates it into descriptor 1, closes the temporary FD, and then executes the command. The command never knows about the temporary descriptor; it just writes to stdout.

open() Flags

open() combines access mode, creation policy, status flags on the open file description, and descriptor hygiene flags like O_CLOEXEC.

Flag	Meaning	Use It When	Hidden Cost or Pitfall
O_RDONLY / O_WRONLY / O_RDWR	Mutually exclusive access mode	Every open call	Wrong mode fails later with EBADF.
O_CREAT	Create file if missing	Creating output files	Requires a mode argument.
O_EXCL	Fail if target already exists	Race-free lock or create	Usually meaningful with O_CREAT.
O_TRUNC	Truncate existing regular file	Replace output contents	Data is destroyed at open time.
O_TMPFILE	Create unnamed inode in a directory	Build complete file before publish	Filesystem support varies.
O_APPEND	Each write appends atomically	Concurrent log writers	lseek does not choose the write offset.
O_NONBLOCK	Do not sleep for readiness	Event loops and pipes	Callers must handle EAGAIN.
O_SYNC / O_DSYNC	Wait for storage durability	Database journals, critical logs	Latency rises sharply.
O_DIRECT	Bypass page cache	Database buffer pools	Buffer, length, and offset alignment are strict.
O_CLOEXEC	Set FD_CLOEXEC at creation	Multithreaded programs that fork+exec	Without it, descriptors leak to children.
O_NOFOLLOW	Fail if final path component is symlink	Security-sensitive path opens	Only protects the final component.
O_PATH	Path reference only	*at() syscalls and metadata	No read or write operations.

With O_DIRECT, the kernel avoids filling the page cache and asks the block layer to DMA directly into user pages, but it can reject unaligned buffers with EINVAL.

read/write Family

read() and write() may return fewer bytes than requested. pread() and pwrite() use explicit offsets, so they avoid races around lseek() plus I/O.

readv() and writev() amortize syscall overhead by scattering or gathering bytes across multiple buffers in one kernel entry.

copy_file_range() copies bytes between two file descriptors inside the kernel. On supporting filesystems it can avoid a user-space bounce buffer and may turn into a filesystem-level extent copy.

lseek and Sparse Files

lseek() changes the open file description offset. Seeking beyond EOF followed by a small write creates a sparse file: logical size grows, but the hole consumes no blocks until data is actually written there.

Filesystems that support SEEK_DATA and SEEK_HOLE let tools skip holes efficiently, and fallocate() can punch holes while preserving apparent size.

dup, dup2, dup3

dup() chooses the lowest free descriptor, dup2() targets an exact descriptor and atomically closes it first, and dup3() adds race-free O_CLOEXEC.

Pipes and FIFOs

A pipe is a kernel ring of pipe_buffer entries with two descriptors: one readable end and one writable end. Bytes never live in either process address space unless a reader or writer copies them through a syscall.

Background: Pipes are the default Unix backpressure primitive. Shell pipelines, logging helpers, and parent-child protocols all depend on the same close semantics: no writers means EOF; no readers means SIGPIPE or EPIPE.

Plan: Create the pipe before fork(), close unused ends in both processes, write records no larger than PIPE_BUF when multiple writers share the pipe, and treat a zero-length read as final EOF.

Example: A parent writes two log lines and closes the write end. The child drains the buffer, then its next read returns 0 because the last writer reference is gone.

Linux keeps writes up to PIPE_BUF atomic with respect to other writers, which is why line-oriented logging over a shared pipe works only when each record stays below that bound.

A FIFO gives the same pipe behavior a persistent filesystem name. Opening the read side normally blocks until a writer appears, and opening the write side normally blocks until a reader appears.

Zero-Copy Plumbing

sendfile() lets static servers move file data from the page cache toward a socket without first copying it into a user buffer.

splice(), tee(), and vmsplice() generalize the idea by using a pipe as the page-reference conduit between file descriptors.

fcntl and File Locks

fcntl() is a descriptor control multiplexer. Some commands modify FD-local state like FD_CLOEXEC, while others modify open-file-description state like O_NONBLOCK.

Command Family	Scope	Typical Use	Pitfall
F_GETFD / F_SETFD	FD slot	Set `FD_CLOEXEC`	Not shared by dup aliases unless set on each descriptor.
F_GETFL / F_SETFL	Open file description	Toggle `O_NONBLOCK` or `O_APPEND`	Affects every dup or fork alias of that description.
F_SETPIPE_SZ	Pipe object	Increase pipe capacity for bursty producers	Capped by `/proc/sys/fs/pipe-max-size` and memory limits.
F_SETLK / F_SETLKW	Process record locks	Advisory byte-range locking	Closing any FD for that file in the process can drop all locks.
F_OFD_SETLK	Open file description locks	Thread-safe byte-range locking	Requires Linux OFD-lock support.

File locks differ sharply in scope: flock() is whole-file and tied to the open file description, classic POSIX record locks are per-process, and OFD locks are byte-range locks tied to the open file description.

Lock Type	Granularity	Ownership	Thread-Safe?
flock	whole file	open file description	Usually yes for duplicated descriptors.
POSIX fcntl lock	byte range	process plus inode	No; unrelated closes can release locks.
OFD fcntl lock	byte range	open file description	Yes; survives unrelated closes.

The classic POSIX lock trap is closing a second descriptor for the same file and accidentally releasing locks that were acquired through the first descriptor.

mmap and Memory-Mapped I/O

mmap() installs a VMA whose page table entries point at file-backed page-cache pages. With MAP_SHARED, stores dirty the shared cache page and can be persisted with msync(); with MAP_PRIVATE, the first write takes a private COW copy.

Background: mmap is attractive when a program wants random access to a file as memory, but the first access to each missing page still has to fault and populate the page cache.

Plan: Create the mapping, touch pages only as needed, use madvise() to describe expected access, and call msync() or fsync() when durability matters.

Example: A parser maps a 200 MB index file, jumps to offset 96 MB, faults one page, and resumes with a normal load instruction after the kernel installs a PTE for that file offset.

The choice between read()/write() and mmap() is a trade-off between explicit syscall/copy control and fault-driven memory access.

Directory Operations

readdir() is a libc wrapper over buffered directory records returned by getdents64(). Filesystems store variable-length directory entries, often behind an index such as ext4 htree for large directories.

The *at() family solves a real race: once a directory is open as a stable dirfd, relative operations do not depend on re-resolving a parent path that another process can rename.

File Metadata

stat(), fstat(), and lstat() report identity, type, permissions, size, link count, owner, and timestamps; statx() adds masks and optional birth time.

Time	Updated By	Notes
atime	Successful file data reads	`relatime` reduces updates; `noatime` disables most of them.
mtime	Content modification	Changes after write, truncate, mmap dirty writeback, or similar content updates.
ctime	Inode status changes	Changes after chmod, chown, link count changes, rename metadata, and content size changes.
btime	File creation	Available through `statx()` only when the filesystem reports it.

Permissions, Ownership, and ACLs

Unix mode uses four octal digits: one digit for setuid, setgid, and sticky bits, then three rwx triplets for user, group, and other.

setuid binaries temporarily run with the file owner as effective UID. Modern systems often replace broad setuid root with file capabilities stored in security.capability, while POSIX ACLs live under system.posix_acl_access.

Links and Atomic Path Operations

A hard link is another directory entry pointing at the same inode, so both names share metadata and data and the inode survives until the last link and last open reference disappear.

A symbolic link is a separate inode whose payload is a pathname string. It can cross filesystems and can dangle if the target path is removed.

rename() is atomic within one filesystem, which makes the write-temp, fsync, rename, fsync-directory pattern the standard way to publish a complete replacement file. An open-then-unlink temporary file has no directory name but remains usable until the last descriptor closes.

Filesystem Notification

inotify turns filesystem changes into readable records on a file descriptor. A watcher registers paths with masks such as IN_CREATE, IN_MODIFY, and IN_DELETE, then reads struct inotify_event entries from the queue.

Recursive watching is not automatic: every subdirectory needs its own watch descriptor, and a newly-created directory must be detected and registered before events inside it can be observed.

fanotify sits at a broader filesystem layer and can deliver permission events, which is why antivirus and audit tools can inspect or deny opens before the target process continues. Large trees hit limits such as /proc/sys/fs/inotify/max_user_watches.

Special Files for Event Loops

Linux exposes many kernel services as ordinary FDs: eventfd for counters and wakeups, signalfd for signals as records, timerfd for readable timer expirations, and memfd_create for anonymous sealable files.

A sealed memfd is useful when one process wants to share bytes but prevent later mutation: write the payload, add seals such as F_SEAL_WRITE, pass the FD over a Unix socket with SCM_RIGHTS, and let the receiver map it read-only.

Filesystem Containment

Container startup changes the process view of the filesystem with a private mount namespace and pivot_root(): the new root is mounted, the old root is moved under put_old, then the old mount is detached before executing the target program.

chroot() only changes path lookup root. It is not a complete sandbox because a process that kept a directory FD to the old root can climb back out; hardened path opens use directory FDs plus openat2() constraints such as RESOLVE_BENEATH.

Extended Attributes

Extended attributes attach small named byte strings to an inode. The namespace prefix controls who owns the meaning: applications commonly use user.*, ACLs use system.*, and LSMs or capabilities use security.*.

Namespace	Common Entries	Who Uses It	Notes
user.*	user.tag, user.comment	Applications and users	Controlled by normal file permissions and mount support.
system.*	system.posix_acl_access	Kernel and filesystem helpers	Stores ACLs and filesystem-managed metadata.
security.*	security.selinux, security.capability	LSMs, file capabilities, IMA	Often requires privilege or LSM policy permission.
trusted.*	trusted.overlay.*	Root-only tools and filesystems	Requires `CAP_SYS_ADMIN`.

stdio vs Raw File Descriptors

FILE* is a libc buffer and formatting layer around an underlying file descriptor. fileno() exposes that FD, fdopen() wraps an FD as a stream, and freopen() redirects an existing stream such as stdout.

Background: stdio buffering improves throughput, but it means data can sit in user memory while raw write() calls on the same FD pass it and reach the kernel first.

Plan: Pick one abstraction per descriptor, flush before mixing layers, and flush all inherited streams before fork() when both parent and child might exit through libc.

Example: A process prints pending without a newline, forks, and both processes later call exit(). The bytes were copied as user-space buffer state, so both processes flush the same text.

Buffer Mode	Common Default	Flush Trigger	Risk
_IOFBF	regular files and redirected stdout	buffer full, fflush, fclose, exit	Raw writes can appear before earlier printf bytes.
_IOLBF	stdout connected to a tty	newline, buffer full, explicit flush	Prompt text without newline can remain hidden.
_IONBF	stderr on many systems	each stdio call writes promptly	More syscalls and lower throughput.

tty and pty

A tty is a terminal device with a kernel line discipline that can echo input, edit cooked-mode lines, and translate control characters such as Ctrl-C into signals. A pty pair splits that terminal into a controller-facing master FD and a program-facing slave FD.

Terminal hosts such as sshd, tmux, IDE panes, and expect use forkpty() or posix_openpt() plus grantpt(), unlockpt(), and ptsname(). The child sees the slave as a real controlling terminal while the host reads and writes the master.

/proc and /sys File Interfaces

/proc is a virtual filesystem generated by the kernel at read time. /proc/self/maps exposes VMAs, /proc/self/fd exposes open descriptors as symlinks, and /proc/sys exposes sysctl tunables.

/sys is sysfs: a kobject attribute tree. Device drivers publish small attributes as files, so reading /sys/class/net/eth0/operstate samples link state and writing mtu calls the driver setter path.

epoll Deep Dive

epoll keeps persistent kernel state: an interest set for registered FDs and a ready list for events already observed. epoll_wait() returns ready entries instead of rescanning every FD like select() or poll().

Background: Edge-triggered epoll is fast because it wakes on readiness transitions, but that means a half-drained FD may never produce another edge.

Plan: Put every ET FD in nonblocking mode, handle the readiness event, then accept or read in a loop until EAGAIN. Use EPOLLONESHOT when worker threads need explicit re-arming and EPOLLEXCLUSIVE to avoid waking many accept waiters for one connection.

Example: A socket receives 6 KB. The event loop reads only 1 KB and returns to epoll_wait(). Because the FD is still ready and no new not-ready-to-ready transition happened, the remaining 5 KB can stall forever.

API	Per-Wait Cost	Scaling	Best Use
select	copy and scan bitsets	limited by fd set size	Small portable programs.
poll	copy and scan array	O(N) per wait	Moderate FD counts with portable semantics.
epoll LT	returns ready list	O(ready)	Safe high-FD-count event loops.
epoll ET	returns readiness edges	O(ready transitions)	High-performance loops that drain to EAGAIN.

io_uring Deep Dive

io_uring maps two rings into user space: the submission queue carries SQEs from user to kernel, and the completion queue carries CQEs from kernel to user. In the steady state, many operations can be submitted and completed with fewer syscalls than epoll plus separate reads or writes.

With IORING_SETUP_SQPOLL, a kernel polling thread watches the submission queue and starts work without the application entering the kernel for every batch. The trade-off is a CPU-consuming poller and stricter setup/permission constraints.

Registered buffers skip repeated page pinning, registered files skip repeated FD lookup, multishot operations can produce many completions from one submission, and IOSQE_IO_LINK expresses ordered chains such as open, read, and close.

Model	What Is Submitted	Syscall Pattern	Where It Wins
epoll + read/write	readiness interest, then synchronous operations	wait syscall plus I/O syscalls	Socket servers with simple operations and broad portability.
io_uring	actual operations: read, write, accept, fsync, openat, splice	batched enter, or SQPOLL hot path	High-throughput mixed file/network I/O with batching and fixed resources.
libaio / KAIO	direct-I/O requests	submit and reap syscalls	Legacy database O_DIRECT workloads; poor fit for buffered I/O.

4. Minimal C Demo

These demos isolate the interview-critical behavior: shared offsets, atomic temporary creation, short-write loops, sparse files, shell-style redirection, mmap persistence, notification FDs, stdio buffering, ptys, epoll drain rules, and io_uring ring mechanics.

dup Shares Offset, open Does Not — C Demo

stdin (optional)

O_TMPFILE then linkat Publish — C Demo

stdin (optional)

full_write Loop for Short Writes — C Demo

stdin (optional)

Create a 1 GB Sparse File — C Demo

stdin (optional)

Implement cmd > file with dup2 — C Demo

stdin (optional)

pipe, EOF, and Broken Pipe — C Demo

stdin (optional)

FIFO Producer and Consumer — C Demo

stdin (optional)

sendfile to a Socket — C Demo

stdin (optional)

Toggle O_NONBLOCK with fcntl — C Demo

stdin (optional)

flock Exclusive Lock Contention — C Demo

stdin (optional)

MAP_SHARED mmap then msync — C Demo

stdin (optional)

Recursive opendir and readdir Walk — C Demo

stdin (optional)

statx Metadata and Birth Time — C Demo

stdin (optional)

setgid Directory Group Inheritance — C Demo

stdin (optional)

Atomic Update with fsync and rename — C Demo

stdin (optional)

inotify Events from /tmp — C Demo

stdin (optional)

epoll with socket, timerfd, signalfd, eventfd — C Demo

stdin (optional)

chroot Jail Setup and Root-Gated Switch — C Demo

stdin (optional)

xattr user.tag Round Trip — C Demo

stdin (optional)

stdio Buffering vs Raw write — C Demo

stdin (optional)

forkpty Minimal Terminal Automation — C Demo

stdin (optional)

Parse /proc/self/maps — C Demo

stdin (optional)

epoll ET Drain to EAGAIN — C Demo

stdin (optional)

Raw io_uring SQ/CQ Completion — C Demo

stdin (optional)

5. Kernel Source Pointers

Topic	Files and Functions	What to Read For
FD table and open	`fs/open.c`, `do_sys_openat2()`, `do_filp_open()`	How path lookup creates a `struct file` and installs it into the descriptor table.
Descriptor allocation	`fs/file.c`, `alloc_fd()`, `fd_install()`, `do_close_on_exec()`	FD bitmaps, close-on-exec state, and fork/exec descriptor handling.
read/write syscalls	`fs/read_write.c`, `ksys_read()`, `ksys_write()`, `vfs_read()`, `vfs_write()`	Short count behavior, position updates, and vector I/O entry points.
dup family	`fs/file.c`, `do_dup2()`, `replace_fd()`	How a descriptor slot is made to reference an existing open file description.
Sparse files	`fs/read_write.c`, `vfs_llseek()`; filesystem `llseek` methods	Where `SEEK_DATA` and `SEEK_HOLE` are delegated to filesystem code.
Pipes and FIFOs	`fs/pipe.c`, `do_pipe2()`, `pipe_read()`, `pipe_write()`	Pipe ring accounting, EOF and broken-pipe rules, and pipe capacity changes.
Zero-copy plumbing	`fs/read_write.c`, `do_sendfile()`; `fs/splice.c`, `do_splice()`, `do_tee()`	How page references move between files, pipes, and sockets without user-space buffers.
fcntl and locks	`fs/fcntl.c`, `do_fcntl()`; `fs/locks.c`, `fcntl_setlk()`, `flock_lock_inode()`	Command dispatch, FD versus file-description scope, and lock ownership semantics.
mmap	`mm/mmap.c`, `do_mmap()`; `mm/filemap.c`, `filemap_fault()`	How VMAs are installed and how file-backed page faults populate page cache pages.
Directory reads and path lookup	`fs/readdir.c`, `iterate_dir()`; `fs/namei.c`, `path_openat()`	Directory iteration, `getdents64`, and relative lookup through `openat`.
Metadata and permissions	`fs/stat.c`, `vfs_statx()`; `fs/attr.c`, `notify_change()`	How stat/statx fields are collected and how chmod/chown update inode attributes.
Links and rename	`fs/namei.c`, `vfs_link()`, `vfs_symlink()`, `vfs_rename()`	Hard link counts, symlink creation, and same-filesystem atomic rename behavior.
Filesystem notification	`fs/notify/`, `fsnotify()`; `fs/notify/inotify/`; `fs/notify/fanotify/`	How VFS events become queued records and how fanotify permission events can block opens.
Special event FDs	`fs/eventfd.c`, `fs/timerfd.c`, `fs/signalfd.c`, `mm/memfd.c`	Counter wakeups, timer expiration reads, signal records, and memfd sealing rules.
Root switching and path containment	`fs/open.c`, `ksys_chroot()`; `fs/namespace.c`, `pivot_root()`; `fs/openat2.c`	Why chroot changes lookup state, how pivot_root moves mounts, and how openat2 resolver flags reject escapes.
Extended attributes	`fs/xattr.c`, `vfs_setxattr()`, `vfs_getxattr()`, `listxattr()`	Namespace permission checks, filesystem callbacks, and ACL/capability storage.
stdio wrappers	`glibc/libio/`, `_IO_file_xsputn()`, `_IO_new_file_write()`	How libc buffering batches user writes before calling the kernel `write` path.
tty and pty	`drivers/tty/tty_io.c`, `drivers/tty/pty.c`, `drivers/tty/n_tty.c`	TTY allocation, pseudo-terminal master/slave plumbing, and line discipline behavior.
/proc and /sys	`fs/proc/`, `proc_pid_make_inode()`; `fs/sysfs/`, `sysfs_create_file_ns()`	How virtual process files and kobject attributes are generated on demand.
epoll	`fs/eventpoll.c`, `do_epoll_create()`, `ep_insert()`, `ep_poll()`	Interest tree management, ready-list wakeups, edge-triggered delivery, and exclusive waits.
io_uring	`io_uring/io_uring.c`, `io_uring_setup()`, `io_submit_sqes()`, `io_cqring_ev_posted()`	Shared SQ/CQ ring setup, SQE consumption, CQE publication, SQPOLL, and registered resources.

6. Interview Prep

Question	Concise Answer
What is shared by `dup()`?	The new descriptor points to the same open file description, so file offset and status flags are shared.
Why is `pread()` safer than `lseek()` plus `read()`?	It performs I/O at an explicit offset without changing the shared file offset, avoiding races between threads or forked children.
What does `O_APPEND` guarantee?	For regular files, the kernel moves each write to EOF atomically with the write operation, so concurrent writers do not overwrite each other.
Why use `O_CLOEXEC` instead of `fcntl(F_SETFD)` after open?	It closes the race where another thread forks and execs between open and the later fcntl call.
What is a sparse file?	A file whose logical size includes holes that read as zeros but have no allocated disk blocks until real data is written.
What happens when the last pipe writer closes?	After buffered bytes are drained, readers get a zero-length read, which is EOF.
What does `PIPE_BUF` guarantee?	Concurrent writes of at most `PIPE_BUF` bytes to a pipe are atomic; larger writes may interleave.
Why does `sendfile()` help static file servers?	It avoids copying file bytes into a user-space buffer before sending them to the socket.
Which `fcntl()` flags are FD-local versus description-local?	`FD_CLOEXEC` is FD-local; status flags like `O_NONBLOCK` and `O_APPEND` live on the open file description.
Why are POSIX record locks dangerous in multithreaded code?	They are process-owned, so closing any descriptor for the same file in that process can release all its record locks.
How does `MAP_SHARED` differ from `MAP_PRIVATE`?	`MAP_SHARED` stores modify shared file-backed pages; `MAP_PRIVATE` writes trigger copy-on-write and do not update the file.
Why use `openat()` with a directory FD?	It anchors lookup to an already-open directory, avoiding races where the parent path is renamed or replaced.
What updates atime, mtime, and ctime?	Reads update atime, content changes update mtime, and inode metadata or size changes update ctime.
Hard link versus symlink?	A hard link is another name for the same inode on the same filesystem; a symlink is a separate inode containing a path string and can cross filesystems.
Why does atomic file update fsync both file and directory?	The file fsync persists contents; the directory fsync persists the renamed directory entry after publication.
inotify versus fanotify?	inotify watches paths and reports events after they happen; fanotify can observe broader filesystem activity and can issue permission events that allow or deny opens.
Why are `eventfd`, `signalfd`, and `timerfd` useful?	They convert wakeups, signals, and timers into readable FDs, so one epoll loop can handle them uniformly with sockets and pipes.
What does a sealed `memfd` buy you?	It gives processes shared anonymous file-backed memory that can be made immutable before the FD is handed to another process.
How can `chroot()` be escaped?	If a process keeps a directory FD outside the jail, it can `fchdir()` back and walk out; mount namespaces, pivot_root cleanup, and constrained openat2 lookups close that class of bug.
What belongs in `security.*` xattrs?	Security labels and enforcement metadata such as SELinux labels, IMA state, and file capabilities.
Why can mixing `printf()` and `write()` reorder output?	`printf()` writes into a user-space `FILE*` buffer, while `write()` enters the kernel immediately unless the stdio buffer is flushed first.
What is the fork double-flush pitfall?	Unflushed stdio buffers are copied into the child, so both parent and child can flush the same pending bytes during `exit()`.
What is a pty master versus slave?	The slave behaves like a terminal for the child program; the master is held by the controller that feeds input and reads terminal output.
How does `/proc/self/fd` help debug FD leaks?	It exposes each live descriptor as a symlink to its target, making leaked files, sockets, pipes, and deleted-but-open files visible.
epoll LT versus ET?	Level-triggered epoll keeps reporting an FD while it remains ready; edge-triggered epoll reports readiness transitions, so nonblocking handlers must drain to `EAGAIN`.
What does `EPOLLONESHOT` solve?	It disables the FD after one event so a worker can process it exclusively, then re-arm it with `EPOLL_CTL_MOD`.
What does `EPOLLEXCLUSIVE` solve?	It prevents a thundering herd by waking only one epoll waiter for a shared ready source such as a listening socket.
Why did io_uring replace libaio for many workloads?	io_uring supports a broader operation set, buffered I/O, sockets, batching, linked operations, fixed buffers/files, and shared-ring completions instead of the narrow KAIO O_DIRECT focus.
Why is SQPOLL called a zero-syscall hot path?	The application advances the shared SQ tail and a kernel polling thread consumes submissions without requiring `io_uring_enter()` for each batch.

§ 19.1-19.25 File Descriptors, mmap, epoll, io_uring, and Unix Plumbing

1. Overview

2. Key Data Structures

fork Copies Descriptor Slots, Not Offsets

3. Core Mechanism

open() Flags

read/write Family

lseek and Sparse Files

dup, dup2, dup3

Pipes and FIFOs

Zero-Copy Plumbing

fcntl and File Locks

mmap and Memory-Mapped I/O

Directory Operations

File Metadata

Permissions, Ownership, and ACLs

Links and Atomic Path Operations

Filesystem Notification

Special Files for Event Loops

Filesystem Containment

Extended Attributes

stdio vs Raw File Descriptors

tty and pty

/proc and /sys File Interfaces

epoll Deep Dive

io_uring Deep Dive

4. Minimal C Demo

5. Kernel Source Pointers

6. Interview Prep