Part XXV — I/O Multiplexing

§ 37 — fd → file → socket → wait_queue Chain

Process fd table & fget() O(1) RCU lookup (§37.1) · struct file: f_op vtable, private_data, f_count (§37.2) · struct socket vs struct sock: two-layer split, sk_wq (§37.3) · vfs_poll() & tcp_poll(): select per-call vs epoll permanent registration (§37.4) · f_count lifecycle, dup() / fork() sharing, EPOLLRDHUP (§37.5)

§ 37.1 — The Process fd Table

Every process carries an open file table rooted at task_struct.files. Translating a raw integer fd into a kernel object is a three-pointer dereference ending in an O(1) array lookup — the entire operation runs under a single RCU read lock with no contention in the common case.

task_struct → fdtable Chain

The fdtable is inline for processes with few open files and heap-allocated when the fd count grows past the initial capacity. fd[5] points directly to the struct file for a TCP socket.

fget(5) — Lock-Free fd Lookup

fget() is the kernel's primary fd-to-file translation. It uses RCU (Read-Copy-Update) to read the fd table without taking a spin-lock — the common case is fully lock-free. A reference count increment via get_file() ensures the file object stays alive while the caller holds the reference.

fget() Source (fs/file.c)

struct file *fget(unsigned int fd) {
    struct file *f;
    struct fdtable *fdt;

    rcu_read_lock();
    fdt = rcu_dereference_raw(current->files->fdt);
    if (fd < fdt->max_fds) {
        f = rcu_dereference_raw(fdt->fd[fd]); /* O(1) array lookup */
        if (f)
            get_file(f);  /* f_count++ */
    }
    rcu_read_unlock();
    return f;
}
StructKey fieldPurpose
task_struct.files: *files_structPer-process open file table root
files_struct.fdt: *fdtableCurrent fd table (inline for small counts, heap for large)
fdtable.fd[]: struct file*[]O(1) fd → file pointer mapping
fdtable.open_fds: bitmapWhich fd slots are currently open
fdtable.max_fds: intCurrent array capacity (grows dynamically)

§ 37.2 — struct file: The VFS Layer

struct file is the VFS abstraction that sits between the raw fd integer and any concrete resource — a socket, a regular file, a pipe, a device node. Every open file descriptor in the kernel is represented by exactly one struct file. The two fields that matter most for networking are f_op (the vtable) and private_data (the concrete object pointer).

struct file Field Layout

f_op Vtable — Polymorphic Dispatch

vfs_poll(file, pt) calls file->f_op->poll(file, pt) — the same call site reaches sock_poll for a socket, generic_file_poll for a regular file, and pipe_poll for a pipe. This is how epoll monitors heterogeneous fd types through a single interface.

FieldTypeValue for a socket
f_op*file_operations&socket_file_ops
f_inode*inodesocket inode (type S_IFSOCK)
private_datavoid *struct socket * — the actual socket object
f_flagsunsigned intO_NONBLOCK etc. set by fcntl()
f_countatomic_long_treference count; freed when 0 via fput()
f_posloff_tfile offset — meaningless for sockets, always 0

sock_poll Entry Point

/* net/socket.c — VFS poll entry for all sockets */
static __poll_t sock_poll(struct file *file, poll_table *wait)
{
    struct socket *sock = file->private_data;  /* O(1) cast */
    const struct proto_ops *ops = READ_ONCE(sock->ops);

    if (!ops->poll)
        return DEFAULT_POLLMASK;

    return ops->poll(file, sock, wait);  /* e.g. tcp_poll() */
}

/* How epoll/select reaches tcp_poll():
 *   vfs_poll(file, pt)
 *     → file->f_op->poll()   = sock_poll()
 *       → sock->ops->poll()  = tcp_poll()  (for TCP sockets)
 *         → poll_wait() + readiness bitmask
 */

§ 37.3 — struct socket & struct sock

A socket is represented by two kernel structs. struct socket is the POSIX/BSD-facing layer — it holds the state and vtable that userspace syscalls touch. struct sock is the protocol-internal layer — it owns the receive and send queues, TCP state machine, and the wait queue where blocking callers sleep. The two are linked by socket.sk.

Two-Level Socket Structure

Full fd → wait_queue Chain

Every hop in this chain is a single pointer dereference. The final destination — wait_queue_head_t — is where both select's poll_table_entry and epoll's eppoll_entry are installed. Knowing this chain answers the interview question: “how does an arriving packet eventually wake a sleeping process?”

Struct Definitions (abbreviated)

/* include/linux/net.h */
struct socket {
    socket_state        state;       /* SS_CONNECTED, SS_UNCONNECTED ... */
    short               type;        /* SOCK_STREAM, SOCK_DGRAM ... */
    const struct proto_ops *ops;     /* vtable: bind/connect/accept/poll */
    struct file        *file;        /* back-pointer to struct file */
    struct sock        *sk;          /* → protocol layer */
    struct socket_wq   *wq;         /* wait queue + fasync */
};

/* include/net/sock.h */
struct sock {
    struct sk_buff_head  sk_receive_queue; /* received skbs */
    struct sk_buff_head  sk_write_queue;   /* outbound skbs */
    struct socket_wq    *sk_wq;           /* same wq as socket.wq */
    void (*sk_data_ready)(struct sock *); /* = sock_def_readable */
    unsigned char        sk_state;        /* TCP_ESTABLISHED etc. */
    int                  sk_sndbuf;
    int                  sk_rcvbuf;
};

/* include/net/sock.h */
struct socket_wq {
    wait_queue_head_t  wait;        /* where waiters sleep */
    struct fasync_struct *fasync_list; /* SIGIO async notification */
};
FieldStructPurpose
socket.opsstruct socketproto_ops vtable — POSIX interface (bind/connect/accept/poll)
socket.skstruct socketPointer to protocol-layer sock
socket.wqstruct socketWait queue + async notification (shared with sock.sk_wq)
sock.sk_receive_queuestruct socksk_buff linked list — received data awaiting read()
sock.sk_wqstruct sockWait queue + async notification (same object as socket.wq)
sock.sk_data_readystruct sockCallback on data arrival; default: sock_def_readable()
socket_wq.waitstruct socket_wqwait_queue_head_t — where eppoll_entry / poll_table_entry attach

§ 37.4 — vfs_poll() & tcp_poll()

vfs_poll() is the single call site that bridges the VFS layer and any underlying resource. For a TCP socket it dispatches to sock_poll tcp_poll, which both registers a wait entry and returns a readiness bitmask. The crucial difference between select and epoll is when that registration happens — once per sleep for select, once forever for epoll.

do_select → vfs_poll Call Chain

For each fd, do_select() calls vfs_poll() on every iteration. Inside tcp_poll(), poll_wait() adds a poll_table_entry to the socket's wait queue. This entry is removed when select() returns.

epoll_ctl ADD Call Chain

epoll_ctl(ADD) also calls vfs_poll(), but the poll_table's _qproc is set to ep_ptable_queue_proc — which installs an eppoll_entry that stays on the wait queue. No add/remove on each epoll_wait().

Demo 3517 — fd Chain Explorer

Walk the kernel pointer chain from a raw fd integer to the wait_queue_head_t step by step. Enter any fd, click Trace, then step through each dereference.

Start
Step 1
Step 2
Step 3
Step 4
Step 5
Step 6
Enter an fd and click Trace to walk the chain.

Demo 3518 — epoll_ctl ADD Step-by-Step

Step through the seven kernel actions that run inside a single epoll_ctl(EPOLL_CTL_ADD) call — from the syscall entry to the final RB-tree insert. Watch when the eppoll_entry lands on sk_wq and when the epitem appears in the tree.

Step 0Initial state

epoll fd=3 exists. RB-tree is empty. Socket fd=5 not yet monitored.

/* epoll fd=3, tcp socket fd=5 */
int epfd = epoll_create1(EPOLL_CLOEXEC); /* fd=3 */
eppoll_entry on sk_wq
not yet installed
epitem in RB-tree
not yet inserted
Key difference from select
select — poll_wait()
Called at sleep time, every call.
add_wait_queue() + remove_wait_queue()
repeated N×M times (N fds, M calls/sec).
epoll — ep_ptable_queue_proc()
Called once at epoll_ctl(ADD).
eppoll_entry lives permanently on sk_wq.
Zero per-call overhead on epoll_wait().

§ 37.5 — Reference Counting & fd Sharing

struct file is reference-counted via f_count (an atomic_long_t). The file object is freed only when the count reaches zero — not when any particular fd is closed. This has a non-obvious consequence for epoll: the epitem key is {file*, fd}, and the file can stay alive long after the original fd is closed if other fds share it via dup() or fork().

f_count Lifecycle

The dup() + epoll Bug

After dup(5), both fd=5 and fd=6 point to the same struct file. close(5) decrements f_count to 1 — the file object survives via fd=6. The epoll epitem, keyed by file*, remains valid and continues firing events even though fd=5 is no longer usable. The only safe fix is epoll_ctl(DEL, 5) before close(5).

EPOLLRDHUP — Detecting Remote Half-Close

/* Detect remote half-close (shutdown(WR) on the other end) */
struct epoll_event ev = {
    .events = EPOLLIN | EPOLLRDHUP,
    .data.fd = fd,
};
epoll_ctl(epfd, EPOLL_CTL_ADD, fd, &ev);

/*
 * Without EPOLLRDHUP:
 *   epoll_wait returns EPOLLIN; you call read(); read() returns 0 → detect close
 *
 * With EPOLLRDHUP:
 *   epoll_wait returns EPOLLIN | EPOLLRDHUP immediately
 *   No extra read() needed to detect the half-close
 *   Catches TCP FIN from the remote peer (shutdown(SHUT_WR))
 */
Operationf_count changeepoll effect
open() / socket()→ 1
dup(fd)+1 (two fds share same struct file)Two fds share same epitem; both trigger it
fork()+1 per open fdChild's inherited fds also monitored by parent's epoll
close(fd) with f_count > 1−1, file stays aliveepoll continues monitoring — epitem not removed
close(fd) with f_count → 0→ 0, file freedKernel auto-removes epitem from RB-tree
epoll_ctl(DEL, fd)no changeepitem explicitly removed — safe even if f_count > 1

fput() — Decrement and Free

/* fs/file_table.c */
void fput(struct file *file)
{
    if (atomic_long_dec_and_test(&file->f_count)) {
        /* f_count reached 0 — schedule final cleanup */
        if (file->f_op->release)
            file->f_op->release(file->f_inode, file); /* sock_close() for sockets */
        /* remove from epoll RB-tree if still registered */
        /* inode release, dentry put, memory freed */
    }
}

/* get_file() is the inverse — used by fget() and dup */
static inline struct file *get_file(struct file *f)
{
    atomic_long_inc(&f->f_count);
    return f;
}

Interview Prep — Must-Know Questions

QuestionKey answer
How does the kernel go from fd=5 to struct file *?current->files->fdt->fd[5] — O(1) array lookup under RCU read lock. fget() increments f_count so the file object stays alive during the operation.
What does private_data in a socket's struct file point to?struct socket * — the POSIX/BSD-facing socket object. It is the bridge from the VFS layer to the protocol stack.
Difference between struct socket and struct sock?struct socket: POSIX/BSD API layer (ops vtable, state, wq, back-pointer to file). struct sock: protocol internal state (sk_receive_queue, sk_write_queue, TCP state machine, sk_data_ready callback). socket.sk points to sock.
What does tcp_poll() do when called from do_select()?Calls poll_wait() to register a poll_table_entry on sk_wq→wait, then checks current readiness: sk_receive_queue non-empty → EPOLLIN; send buffer has space → EPOLLOUT. Returns the bitmask to do_select.
How does poll_wait() differ from ep_ptable_queue_proc()?poll_wait() is called at sleep time on every select() call — the entry is added then removed on return. ep_ptable_queue_proc() is called once at epoll_ctl(ADD); the eppoll_entry is permanent and survives all subsequent epoll_wait() calls.
dup(fd) then close(fd) — is it still in epoll?Yes. The epitem key is {file*, fd}. dup() increments f_count to 2; close(fd) decrements to 1 — the file stays alive via the dup'd fd. epoll continues monitoring and ep_poll_callback() can still fire.
What is sk_data_ready and when is it called?A function pointer on struct sock, default = sock_def_readable(). Called by the TCP receive path after an skb is added to sk_receive_queue. sock_def_readable() calls wake_up_interruptible_all() on sk_wq→wait, triggering ep_poll_callback() or pollwake().