§ 19.1-19.25 File Descriptors, mmap, epoll, io_uring, and Unix Plumbing
The Unix file model is a small integer in user space backed by shared kernel objects, virtual memory mappings, metadata, path operations, notification queues, terminal devices, and high-performance event rings.
1. Overview
File I/O starts with an integer file descriptor, but the kernel work happens below it: a per-process descriptor table points to an open file description, the description points through VFS to an inode, and the inode owns metadata plus page-cache mappings.
2. Key Data Structures
The important distinction is descriptor slot versus open file description. dup() and fork() create new descriptor references to the same description, while a second open() usually creates a fresh description.
| Kernel Object | Main Fields | Scope | Why It Matters |
|---|---|---|---|
| FD table | array index, close-on-exec bit, pointer to file | per process, copied on fork | Controls which integer names are open and which survive exec. |
| Open file description | file offset, status flags, refcount, access mode | shared by dup and fork aliases | O_APPEND, O_NONBLOCK, and offsets live here. |
| inode | device, inode number, mode, size, mapping | filesystem object | Identifies the file and connects I/O to metadata and page cache. |
| page cache | file offset to page, dirty state, writeback state | shared kernel cache | Normal reads and writes are cached here before storage I/O. |
fork Copies Descriptor Slots, Not Offsets
After fork(), parent and child have separate FD tables, but each copied slot points at the same open file description, so offset movement is visible to both.
3. Core Mechanism
Background: Most file I/O bugs are scope bugs: the programmer changes a descriptor-local thing when the shared open file description matters, or assumes a shared offset is private.
Plan: First decide whether a call creates a new open file description or aliases an existing one. Then decide whether the operation uses the implicit shared offset or an explicit offset. Finally, set close-on-exec atomically at creation time when a descriptor must not leak into a new program image.
Example: A shell opening out.txt for cmd > out.txt creates one write-only description, duplicates it into descriptor 1, closes the temporary FD, and then executes the command. The command never knows about the temporary descriptor; it just writes to stdout.
open() Flags
open() combines access mode, creation policy, status flags on the open file description, and descriptor hygiene flags like O_CLOEXEC.
| Flag | Meaning | Use It When | Hidden Cost or Pitfall |
|---|---|---|---|
| O_RDONLY / O_WRONLY / O_RDWR | Mutually exclusive access mode | Every open call | Wrong mode fails later with EBADF. |
| O_CREAT | Create file if missing | Creating output files | Requires a mode argument. |
| O_EXCL | Fail if target already exists | Race-free lock or create | Usually meaningful with O_CREAT. |
| O_TRUNC | Truncate existing regular file | Replace output contents | Data is destroyed at open time. |
| O_TMPFILE | Create unnamed inode in a directory | Build complete file before publish | Filesystem support varies. |
| O_APPEND | Each write appends atomically | Concurrent log writers | lseek does not choose the write offset. |
| O_NONBLOCK | Do not sleep for readiness | Event loops and pipes | Callers must handle EAGAIN. |
| O_SYNC / O_DSYNC | Wait for storage durability | Database journals, critical logs | Latency rises sharply. |
| O_DIRECT | Bypass page cache | Database buffer pools | Buffer, length, and offset alignment are strict. |
| O_CLOEXEC | Set FD_CLOEXEC at creation | Multithreaded programs that fork+exec | Without it, descriptors leak to children. |
| O_NOFOLLOW | Fail if final path component is symlink | Security-sensitive path opens | Only protects the final component. |
| O_PATH | Path reference only | *at() syscalls and metadata | No read or write operations. |
With O_DIRECT, the kernel avoids filling the page cache and asks the block layer to DMA directly into user pages, but it can reject unaligned buffers with EINVAL.
read/write Family
read() and write() may return fewer bytes than requested. pread() and pwrite() use explicit offsets, so they avoid races around lseek() plus I/O.
readv() and writev() amortize syscall overhead by scattering or gathering bytes across multiple buffers in one kernel entry.
copy_file_range() copies bytes between two file descriptors inside the kernel. On supporting filesystems it can avoid a user-space bounce buffer and may turn into a filesystem-level extent copy.
lseek and Sparse Files
lseek() changes the open file description offset. Seeking beyond EOF followed by a small write creates a sparse file: logical size grows, but the hole consumes no blocks until data is actually written there.
Filesystems that support SEEK_DATA and SEEK_HOLE let tools skip holes efficiently, and fallocate() can punch holes while preserving apparent size.
dup, dup2, dup3
dup() chooses the lowest free descriptor, dup2() targets an exact descriptor and atomically closes it first, and dup3() adds race-free O_CLOEXEC.
Pipes and FIFOs
A pipe is a kernel ring of pipe_buffer entries with two descriptors: one readable end and one writable end. Bytes never live in either process address space unless a reader or writer copies them through a syscall.
Background: Pipes are the default Unix backpressure primitive. Shell pipelines, logging helpers, and parent-child protocols all depend on the same close semantics: no writers means EOF; no readers means SIGPIPE or EPIPE.
Plan: Create the pipe before fork(), close unused ends in both processes, write records no larger than PIPE_BUF when multiple writers share the pipe, and treat a zero-length read as final EOF.
Example: A parent writes two log lines and closes the write end. The child drains the buffer, then its next read returns 0 because the last writer reference is gone.
Linux keeps writes up to PIPE_BUF atomic with respect to other writers, which is why line-oriented logging over a shared pipe works only when each record stays below that bound.
A FIFO gives the same pipe behavior a persistent filesystem name. Opening the read side normally blocks until a writer appears, and opening the write side normally blocks until a reader appears.
Zero-Copy Plumbing
sendfile() lets static servers move file data from the page cache toward a socket without first copying it into a user buffer.
splice(), tee(), and vmsplice() generalize the idea by using a pipe as the page-reference conduit between file descriptors.
fcntl and File Locks
fcntl() is a descriptor control multiplexer. Some commands modify FD-local state like FD_CLOEXEC, while others modify open-file-description state like O_NONBLOCK.
| Command Family | Scope | Typical Use | Pitfall |
|---|---|---|---|
| F_GETFD / F_SETFD | FD slot | Set FD_CLOEXEC | Not shared by dup aliases unless set on each descriptor. |
| F_GETFL / F_SETFL | Open file description | Toggle O_NONBLOCK or O_APPEND | Affects every dup or fork alias of that description. |
| F_SETPIPE_SZ | Pipe object | Increase pipe capacity for bursty producers | Capped by /proc/sys/fs/pipe-max-size and memory limits. |
| F_SETLK / F_SETLKW | Process record locks | Advisory byte-range locking | Closing any FD for that file in the process can drop all locks. |
| F_OFD_SETLK | Open file description locks | Thread-safe byte-range locking | Requires Linux OFD-lock support. |
File locks differ sharply in scope: flock() is whole-file and tied to the open file description, classic POSIX record locks are per-process, and OFD locks are byte-range locks tied to the open file description.
| Lock Type | Granularity | Ownership | Thread-Safe? |
|---|---|---|---|
| flock | whole file | open file description | Usually yes for duplicated descriptors. |
| POSIX fcntl lock | byte range | process plus inode | No; unrelated closes can release locks. |
| OFD fcntl lock | byte range | open file description | Yes; survives unrelated closes. |
The classic POSIX lock trap is closing a second descriptor for the same file and accidentally releasing locks that were acquired through the first descriptor.
mmap and Memory-Mapped I/O
mmap() installs a VMA whose page table entries point at file-backed page-cache pages. With MAP_SHARED, stores dirty the shared cache page and can be persisted with msync(); with MAP_PRIVATE, the first write takes a private COW copy.
Background: mmap is attractive when a program wants random access to a file as memory, but the first access to each missing page still has to fault and populate the page cache.
Plan: Create the mapping, touch pages only as needed, use madvise() to describe expected access, and call msync() or fsync() when durability matters.
Example: A parser maps a 200 MB index file, jumps to offset 96 MB, faults one page, and resumes with a normal load instruction after the kernel installs a PTE for that file offset.
The choice between read()/write() and mmap() is a trade-off between explicit syscall/copy control and fault-driven memory access.
Directory Operations
readdir() is a libc wrapper over buffered directory records returned by getdents64(). Filesystems store variable-length directory entries, often behind an index such as ext4 htree for large directories.
The *at() family solves a real race: once a directory is open as a stable dirfd, relative operations do not depend on re-resolving a parent path that another process can rename.
File Metadata
stat(), fstat(), and lstat() report identity, type, permissions, size, link count, owner, and timestamps; statx() adds masks and optional birth time.
| Time | Updated By | Notes |
|---|---|---|
| atime | Successful file data reads | relatime reduces updates; noatime disables most of them. |
| mtime | Content modification | Changes after write, truncate, mmap dirty writeback, or similar content updates. |
| ctime | Inode status changes | Changes after chmod, chown, link count changes, rename metadata, and content size changes. |
| btime | File creation | Available through statx() only when the filesystem reports it. |
Permissions, Ownership, and ACLs
Unix mode uses four octal digits: one digit for setuid, setgid, and sticky bits, then three rwx triplets for user, group, and other.
setuid binaries temporarily run with the file owner as effective UID. Modern systems often replace broad setuid root with file capabilities stored in security.capability, while POSIX ACLs live under system.posix_acl_access.
Links and Atomic Path Operations
A hard link is another directory entry pointing at the same inode, so both names share metadata and data and the inode survives until the last link and last open reference disappear.
A symbolic link is a separate inode whose payload is a pathname string. It can cross filesystems and can dangle if the target path is removed.
rename() is atomic within one filesystem, which makes the write-temp, fsync, rename, fsync-directory pattern the standard way to publish a complete replacement file. An open-then-unlink temporary file has no directory name but remains usable until the last descriptor closes.
Filesystem Notification
inotify turns filesystem changes into readable records on a file descriptor. A watcher registers paths with masks such as IN_CREATE, IN_MODIFY, and IN_DELETE, then reads struct inotify_event entries from the queue.
Recursive watching is not automatic: every subdirectory needs its own watch descriptor, and a newly-created directory must be detected and registered before events inside it can be observed.
fanotify sits at a broader filesystem layer and can deliver permission events, which is why antivirus and audit tools can inspect or deny opens before the target process continues. Large trees hit limits such as /proc/sys/fs/inotify/max_user_watches.
Special Files for Event Loops
Linux exposes many kernel services as ordinary FDs: eventfd for counters and wakeups, signalfd for signals as records, timerfd for readable timer expirations, and memfd_create for anonymous sealable files.
A sealed memfd is useful when one process wants to share bytes but prevent later mutation: write the payload, add seals such as F_SEAL_WRITE, pass the FD over a Unix socket with SCM_RIGHTS, and let the receiver map it read-only.
Filesystem Containment
Container startup changes the process view of the filesystem with a private mount namespace and pivot_root(): the new root is mounted, the old root is moved under put_old, then the old mount is detached before executing the target program.
chroot() only changes path lookup root. It is not a complete sandbox because a process that kept a directory FD to the old root can climb back out; hardened path opens use directory FDs plus openat2() constraints such as RESOLVE_BENEATH.
Extended Attributes
Extended attributes attach small named byte strings to an inode. The namespace prefix controls who owns the meaning: applications commonly use user.*, ACLs use system.*, and LSMs or capabilities use security.*.
| Namespace | Common Entries | Who Uses It | Notes |
|---|---|---|---|
| user.* | user.tag, user.comment | Applications and users | Controlled by normal file permissions and mount support. |
| system.* | system.posix_acl_access | Kernel and filesystem helpers | Stores ACLs and filesystem-managed metadata. |
| security.* | security.selinux, security.capability | LSMs, file capabilities, IMA | Often requires privilege or LSM policy permission. |
| trusted.* | trusted.overlay.* | Root-only tools and filesystems | Requires CAP_SYS_ADMIN. |
stdio vs Raw File Descriptors
FILE* is a libc buffer and formatting layer around an underlying file descriptor. fileno() exposes that FD, fdopen() wraps an FD as a stream, and freopen() redirects an existing stream such as stdout.
Background: stdio buffering improves throughput, but it means data can sit in user memory while raw write() calls on the same FD pass it and reach the kernel first.
Plan: Pick one abstraction per descriptor, flush before mixing layers, and flush all inherited streams before fork() when both parent and child might exit through libc.
Example: A process prints pending without a newline, forks, and both processes later call exit(). The bytes were copied as user-space buffer state, so both processes flush the same text.
| Buffer Mode | Common Default | Flush Trigger | Risk |
|---|---|---|---|
| _IOFBF | regular files and redirected stdout | buffer full, fflush, fclose, exit | Raw writes can appear before earlier printf bytes. |
| _IOLBF | stdout connected to a tty | newline, buffer full, explicit flush | Prompt text without newline can remain hidden. |
| _IONBF | stderr on many systems | each stdio call writes promptly | More syscalls and lower throughput. |
tty and pty
A tty is a terminal device with a kernel line discipline that can echo input, edit cooked-mode lines, and translate control characters such as Ctrl-C into signals. A pty pair splits that terminal into a controller-facing master FD and a program-facing slave FD.
Terminal hosts such as sshd, tmux, IDE panes, and expect use forkpty() or posix_openpt() plus grantpt(), unlockpt(), and ptsname(). The child sees the slave as a real controlling terminal while the host reads and writes the master.
/proc and /sys File Interfaces
/proc is a virtual filesystem generated by the kernel at read time. /proc/self/maps exposes VMAs, /proc/self/fd exposes open descriptors as symlinks, and /proc/sys exposes sysctl tunables.
/sys is sysfs: a kobject attribute tree. Device drivers publish small attributes as files, so reading /sys/class/net/eth0/operstate samples link state and writing mtu calls the driver setter path.
epoll Deep Dive
epoll keeps persistent kernel state: an interest set for registered FDs and a ready list for events already observed. epoll_wait() returns ready entries instead of rescanning every FD like select() or poll().
Background: Edge-triggered epoll is fast because it wakes on readiness transitions, but that means a half-drained FD may never produce another edge.
Plan: Put every ET FD in nonblocking mode, handle the readiness event, then accept or read in a loop until EAGAIN. Use EPOLLONESHOT when worker threads need explicit re-arming and EPOLLEXCLUSIVE to avoid waking many accept waiters for one connection.
Example: A socket receives 6 KB. The event loop reads only 1 KB and returns to epoll_wait(). Because the FD is still ready and no new not-ready-to-ready transition happened, the remaining 5 KB can stall forever.
| API | Per-Wait Cost | Scaling | Best Use |
|---|---|---|---|
| select | copy and scan bitsets | limited by fd set size | Small portable programs. |
| poll | copy and scan array | O(N) per wait | Moderate FD counts with portable semantics. |
| epoll LT | returns ready list | O(ready) | Safe high-FD-count event loops. |
| epoll ET | returns readiness edges | O(ready transitions) | High-performance loops that drain to EAGAIN. |
io_uring Deep Dive
io_uring maps two rings into user space: the submission queue carries SQEs from user to kernel, and the completion queue carries CQEs from kernel to user. In the steady state, many operations can be submitted and completed with fewer syscalls than epoll plus separate reads or writes.
With IORING_SETUP_SQPOLL, a kernel polling thread watches the submission queue and starts work without the application entering the kernel for every batch. The trade-off is a CPU-consuming poller and stricter setup/permission constraints.
Registered buffers skip repeated page pinning, registered files skip repeated FD lookup, multishot operations can produce many completions from one submission, and IOSQE_IO_LINK expresses ordered chains such as open, read, and close.
| Model | What Is Submitted | Syscall Pattern | Where It Wins |
|---|---|---|---|
| epoll + read/write | readiness interest, then synchronous operations | wait syscall plus I/O syscalls | Socket servers with simple operations and broad portability. |
| io_uring | actual operations: read, write, accept, fsync, openat, splice | batched enter, or SQPOLL hot path | High-throughput mixed file/network I/O with batching and fixed resources. |
| libaio / KAIO | direct-I/O requests | submit and reap syscalls | Legacy database O_DIRECT workloads; poor fit for buffered I/O. |
4. Minimal C Demo
These demos isolate the interview-critical behavior: shared offsets, atomic temporary creation, short-write loops, sparse files, shell-style redirection, mmap persistence, notification FDs, stdio buffering, ptys, epoll drain rules, and io_uring ring mechanics.
5. Kernel Source Pointers
| Topic | Files and Functions | What to Read For |
|---|---|---|
| FD table and open | fs/open.c, do_sys_openat2(), do_filp_open() | How path lookup creates a struct file and installs it into the descriptor table. |
| Descriptor allocation | fs/file.c, alloc_fd(), fd_install(), do_close_on_exec() | FD bitmaps, close-on-exec state, and fork/exec descriptor handling. |
| read/write syscalls | fs/read_write.c, ksys_read(), ksys_write(), vfs_read(), vfs_write() | Short count behavior, position updates, and vector I/O entry points. |
| dup family | fs/file.c, do_dup2(), replace_fd() | How a descriptor slot is made to reference an existing open file description. |
| Sparse files | fs/read_write.c, vfs_llseek(); filesystem llseek methods | Where SEEK_DATA and SEEK_HOLE are delegated to filesystem code. |
| Pipes and FIFOs | fs/pipe.c, do_pipe2(), pipe_read(), pipe_write() | Pipe ring accounting, EOF and broken-pipe rules, and pipe capacity changes. |
| Zero-copy plumbing | fs/read_write.c, do_sendfile(); fs/splice.c, do_splice(), do_tee() | How page references move between files, pipes, and sockets without user-space buffers. |
| fcntl and locks | fs/fcntl.c, do_fcntl(); fs/locks.c, fcntl_setlk(), flock_lock_inode() | Command dispatch, FD versus file-description scope, and lock ownership semantics. |
| mmap | mm/mmap.c, do_mmap(); mm/filemap.c, filemap_fault() | How VMAs are installed and how file-backed page faults populate page cache pages. |
| Directory reads and path lookup | fs/readdir.c, iterate_dir(); fs/namei.c, path_openat() | Directory iteration, getdents64, and relative lookup through openat. |
| Metadata and permissions | fs/stat.c, vfs_statx(); fs/attr.c, notify_change() | How stat/statx fields are collected and how chmod/chown update inode attributes. |
| Links and rename | fs/namei.c, vfs_link(), vfs_symlink(), vfs_rename() | Hard link counts, symlink creation, and same-filesystem atomic rename behavior. |
| Filesystem notification | fs/notify/, fsnotify(); fs/notify/inotify/; fs/notify/fanotify/ | How VFS events become queued records and how fanotify permission events can block opens. |
| Special event FDs | fs/eventfd.c, fs/timerfd.c, fs/signalfd.c, mm/memfd.c | Counter wakeups, timer expiration reads, signal records, and memfd sealing rules. |
| Root switching and path containment | fs/open.c, ksys_chroot(); fs/namespace.c, pivot_root(); fs/openat2.c | Why chroot changes lookup state, how pivot_root moves mounts, and how openat2 resolver flags reject escapes. |
| Extended attributes | fs/xattr.c, vfs_setxattr(), vfs_getxattr(), listxattr() | Namespace permission checks, filesystem callbacks, and ACL/capability storage. |
| stdio wrappers | glibc/libio/, _IO_file_xsputn(), _IO_new_file_write() | How libc buffering batches user writes before calling the kernel write path. |
| tty and pty | drivers/tty/tty_io.c, drivers/tty/pty.c, drivers/tty/n_tty.c | TTY allocation, pseudo-terminal master/slave plumbing, and line discipline behavior. |
| /proc and /sys | fs/proc/, proc_pid_make_inode(); fs/sysfs/, sysfs_create_file_ns() | How virtual process files and kobject attributes are generated on demand. |
| epoll | fs/eventpoll.c, do_epoll_create(), ep_insert(), ep_poll() | Interest tree management, ready-list wakeups, edge-triggered delivery, and exclusive waits. |
| io_uring | io_uring/io_uring.c, io_uring_setup(), io_submit_sqes(), io_cqring_ev_posted() | Shared SQ/CQ ring setup, SQE consumption, CQE publication, SQPOLL, and registered resources. |
6. Interview Prep
| Question | Concise Answer |
|---|---|
What is shared by dup()? | The new descriptor points to the same open file description, so file offset and status flags are shared. |
Why is pread() safer than lseek() plus read()? | It performs I/O at an explicit offset without changing the shared file offset, avoiding races between threads or forked children. |
What does O_APPEND guarantee? | For regular files, the kernel moves each write to EOF atomically with the write operation, so concurrent writers do not overwrite each other. |
Why use O_CLOEXEC instead of fcntl(F_SETFD) after open? | It closes the race where another thread forks and execs between open and the later fcntl call. |
| What is a sparse file? | A file whose logical size includes holes that read as zeros but have no allocated disk blocks until real data is written. |
| What happens when the last pipe writer closes? | After buffered bytes are drained, readers get a zero-length read, which is EOF. |
What does PIPE_BUF guarantee? | Concurrent writes of at most PIPE_BUF bytes to a pipe are atomic; larger writes may interleave. |
Why does sendfile() help static file servers? | It avoids copying file bytes into a user-space buffer before sending them to the socket. |
Which fcntl() flags are FD-local versus description-local? | FD_CLOEXEC is FD-local; status flags like O_NONBLOCK and O_APPEND live on the open file description. |
| Why are POSIX record locks dangerous in multithreaded code? | They are process-owned, so closing any descriptor for the same file in that process can release all its record locks. |
How does MAP_SHARED differ from MAP_PRIVATE? | MAP_SHARED stores modify shared file-backed pages; MAP_PRIVATE writes trigger copy-on-write and do not update the file. |
Why use openat() with a directory FD? | It anchors lookup to an already-open directory, avoiding races where the parent path is renamed or replaced. |
| What updates atime, mtime, and ctime? | Reads update atime, content changes update mtime, and inode metadata or size changes update ctime. |
| Hard link versus symlink? | A hard link is another name for the same inode on the same filesystem; a symlink is a separate inode containing a path string and can cross filesystems. |
| Why does atomic file update fsync both file and directory? | The file fsync persists contents; the directory fsync persists the renamed directory entry after publication. |
| inotify versus fanotify? | inotify watches paths and reports events after they happen; fanotify can observe broader filesystem activity and can issue permission events that allow or deny opens. |
Why are eventfd, signalfd, and timerfd useful? | They convert wakeups, signals, and timers into readable FDs, so one epoll loop can handle them uniformly with sockets and pipes. |
What does a sealed memfd buy you? | It gives processes shared anonymous file-backed memory that can be made immutable before the FD is handed to another process. |
How can chroot() be escaped? | If a process keeps a directory FD outside the jail, it can fchdir() back and walk out; mount namespaces, pivot_root cleanup, and constrained openat2 lookups close that class of bug. |
What belongs in security.* xattrs? | Security labels and enforcement metadata such as SELinux labels, IMA state, and file capabilities. |
Why can mixing printf() and write() reorder output? | printf() writes into a user-space FILE* buffer, while write() enters the kernel immediately unless the stdio buffer is flushed first. |
| What is the fork double-flush pitfall? | Unflushed stdio buffers are copied into the child, so both parent and child can flush the same pending bytes during exit(). |
| What is a pty master versus slave? | The slave behaves like a terminal for the child program; the master is held by the controller that feeds input and reads terminal output. |
How does /proc/self/fd help debug FD leaks? | It exposes each live descriptor as a symlink to its target, making leaked files, sockets, pipes, and deleted-but-open files visible. |
| epoll LT versus ET? | Level-triggered epoll keeps reporting an FD while it remains ready; edge-triggered epoll reports readiness transitions, so nonblocking handlers must drain to EAGAIN. |
What does EPOLLONESHOT solve? | It disables the FD after one event so a worker can process it exclusively, then re-arm it with EPOLL_CTL_MOD. |
What does EPOLLEXCLUSIVE solve? | It prevents a thundering herd by waking only one epoll waiter for a shared ready source such as a listening socket. |
| Why did io_uring replace libaio for many workloads? | io_uring supports a broader operation set, buffered I/O, sockets, batching, linked operations, fixed buffers/files, and shared-ring completions instead of the narrow KAIO O_DIRECT focus. |
| Why is SQPOLL called a zero-syscall hot path? | The application advances the shared SQ tail and a kernel polling thread consumes submissions without requiring io_uring_enter() for each batch. |