§ 37 — fd → file → socket → wait_queue Chain
Process fd table & fget() O(1) RCU lookup (§37.1) · struct file: f_op vtable, private_data, f_count (§37.2) · struct socket vs struct sock: two-layer split, sk_wq (§37.3) · vfs_poll() & tcp_poll(): select per-call vs epoll permanent registration (§37.4) · f_count lifecycle, dup() / fork() sharing, EPOLLRDHUP (§37.5)
§ 37.1 — The Process fd Table
Every process carries an open file table rooted at task_struct.files. Translating a raw integer fd into a kernel object is a three-pointer dereference ending in an O(1) array lookup — the entire operation runs under a single RCU read lock with no contention in the common case.
task_struct → fdtable Chain
The fdtable is inline for processes with few open files and heap-allocated when the fd count grows past the initial capacity. fd[5] points directly to the struct file for a TCP socket.
fget(5) — Lock-Free fd Lookup
fget() is the kernel's primary fd-to-file translation. It uses RCU (Read-Copy-Update) to read the fd table without taking a spin-lock — the common case is fully lock-free. A reference count increment via get_file() ensures the file object stays alive while the caller holds the reference.
fget() Source (fs/file.c)
struct file *fget(unsigned int fd) {
struct file *f;
struct fdtable *fdt;
rcu_read_lock();
fdt = rcu_dereference_raw(current->files->fdt);
if (fd < fdt->max_fds) {
f = rcu_dereference_raw(fdt->fd[fd]); /* O(1) array lookup */
if (f)
get_file(f); /* f_count++ */
}
rcu_read_unlock();
return f;
}| Struct | Key field | Purpose |
|---|---|---|
task_struct | .files: *files_struct | Per-process open file table root |
files_struct | .fdt: *fdtable | Current fd table (inline for small counts, heap for large) |
fdtable | .fd[]: struct file*[] | O(1) fd → file pointer mapping |
fdtable | .open_fds: bitmap | Which fd slots are currently open |
fdtable | .max_fds: int | Current array capacity (grows dynamically) |
§ 37.2 — struct file: The VFS Layer
struct file is the VFS abstraction that sits between the raw fd integer and any concrete resource — a socket, a regular file, a pipe, a device node. Every open file descriptor in the kernel is represented by exactly one struct file. The two fields that matter most for networking are f_op (the vtable) and private_data (the concrete object pointer).
struct file Field Layout
f_op Vtable — Polymorphic Dispatch
vfs_poll(file, pt) calls file->f_op->poll(file, pt) — the same call site reaches sock_poll for a socket, generic_file_poll for a regular file, and pipe_poll for a pipe. This is how epoll monitors heterogeneous fd types through a single interface.
| Field | Type | Value for a socket |
|---|---|---|
f_op | *file_operations | &socket_file_ops |
f_inode | *inode | socket inode (type S_IFSOCK) |
private_data | void * | struct socket * — the actual socket object |
f_flags | unsigned int | O_NONBLOCK etc. set by fcntl() |
f_count | atomic_long_t | reference count; freed when 0 via fput() |
f_pos | loff_t | file offset — meaningless for sockets, always 0 |
sock_poll Entry Point
/* net/socket.c — VFS poll entry for all sockets */
static __poll_t sock_poll(struct file *file, poll_table *wait)
{
struct socket *sock = file->private_data; /* O(1) cast */
const struct proto_ops *ops = READ_ONCE(sock->ops);
if (!ops->poll)
return DEFAULT_POLLMASK;
return ops->poll(file, sock, wait); /* e.g. tcp_poll() */
}
/* How epoll/select reaches tcp_poll():
* vfs_poll(file, pt)
* → file->f_op->poll() = sock_poll()
* → sock->ops->poll() = tcp_poll() (for TCP sockets)
* → poll_wait() + readiness bitmask
*/§ 37.3 — struct socket & struct sock
A socket is represented by two kernel structs. struct socket is the POSIX/BSD-facing layer — it holds the state and vtable that userspace syscalls touch. struct sock is the protocol-internal layer — it owns the receive and send queues, TCP state machine, and the wait queue where blocking callers sleep. The two are linked by socket.sk.
Two-Level Socket Structure
Full fd → wait_queue Chain
Every hop in this chain is a single pointer dereference. The final destination — wait_queue_head_t — is where both select's poll_table_entry and epoll's eppoll_entry are installed. Knowing this chain answers the interview question: “how does an arriving packet eventually wake a sleeping process?”
Struct Definitions (abbreviated)
/* include/linux/net.h */
struct socket {
socket_state state; /* SS_CONNECTED, SS_UNCONNECTED ... */
short type; /* SOCK_STREAM, SOCK_DGRAM ... */
const struct proto_ops *ops; /* vtable: bind/connect/accept/poll */
struct file *file; /* back-pointer to struct file */
struct sock *sk; /* → protocol layer */
struct socket_wq *wq; /* wait queue + fasync */
};
/* include/net/sock.h */
struct sock {
struct sk_buff_head sk_receive_queue; /* received skbs */
struct sk_buff_head sk_write_queue; /* outbound skbs */
struct socket_wq *sk_wq; /* same wq as socket.wq */
void (*sk_data_ready)(struct sock *); /* = sock_def_readable */
unsigned char sk_state; /* TCP_ESTABLISHED etc. */
int sk_sndbuf;
int sk_rcvbuf;
};
/* include/net/sock.h */
struct socket_wq {
wait_queue_head_t wait; /* where waiters sleep */
struct fasync_struct *fasync_list; /* SIGIO async notification */
};| Field | Struct | Purpose |
|---|---|---|
socket.ops | struct socket | proto_ops vtable — POSIX interface (bind/connect/accept/poll) |
socket.sk | struct socket | Pointer to protocol-layer sock |
socket.wq | struct socket | Wait queue + async notification (shared with sock.sk_wq) |
sock.sk_receive_queue | struct sock | sk_buff linked list — received data awaiting read() |
sock.sk_wq | struct sock | Wait queue + async notification (same object as socket.wq) |
sock.sk_data_ready | struct sock | Callback on data arrival; default: sock_def_readable() |
socket_wq.wait | struct socket_wq | wait_queue_head_t — where eppoll_entry / poll_table_entry attach |
§ 37.4 — vfs_poll() & tcp_poll()
vfs_poll() is the single call site that bridges the VFS layer and any underlying resource. For a TCP socket it dispatches to sock_poll → tcp_poll, which both registers a wait entry and returns a readiness bitmask. The crucial difference between select and epoll is when that registration happens — once per sleep for select, once forever for epoll.
do_select → vfs_poll Call Chain
For each fd, do_select() calls vfs_poll() on every iteration. Inside tcp_poll(), poll_wait() adds a poll_table_entry to the socket's wait queue. This entry is removed when select() returns.
epoll_ctl ADD Call Chain
epoll_ctl(ADD) also calls vfs_poll(), but the poll_table's _qproc is set to ep_ptable_queue_proc — which installs an eppoll_entry that stays on the wait queue. No add/remove on each epoll_wait().
Demo 3517 — fd Chain Explorer
Walk the kernel pointer chain from a raw fd integer to the wait_queue_head_t step by step. Enter any fd, click Trace, then step through each dereference.
Demo 3518 — epoll_ctl ADD Step-by-Step
Step through the seven kernel actions that run inside a single epoll_ctl(EPOLL_CTL_ADD) call — from the syscall entry to the final RB-tree insert. Watch when the eppoll_entry lands on sk_wq and when the epitem appears in the tree.
epoll fd=3 exists. RB-tree is empty. Socket fd=5 not yet monitored.
/* epoll fd=3, tcp socket fd=5 */ int epfd = epoll_create1(EPOLL_CLOEXEC); /* fd=3 */
§ 37.5 — Reference Counting & fd Sharing
struct file is reference-counted via f_count (an atomic_long_t). The file object is freed only when the count reaches zero — not when any particular fd is closed. This has a non-obvious consequence for epoll: the epitem key is {file*, fd}, and the file can stay alive long after the original fd is closed if other fds share it via dup() or fork().
f_count Lifecycle
The dup() + epoll Bug
After dup(5), both fd=5 and fd=6 point to the same struct file. close(5) decrements f_count to 1 — the file object survives via fd=6. The epoll epitem, keyed by file*, remains valid and continues firing events even though fd=5 is no longer usable. The only safe fix is epoll_ctl(DEL, 5) before close(5).
EPOLLRDHUP — Detecting Remote Half-Close
/* Detect remote half-close (shutdown(WR) on the other end) */
struct epoll_event ev = {
.events = EPOLLIN | EPOLLRDHUP,
.data.fd = fd,
};
epoll_ctl(epfd, EPOLL_CTL_ADD, fd, &ev);
/*
* Without EPOLLRDHUP:
* epoll_wait returns EPOLLIN; you call read(); read() returns 0 → detect close
*
* With EPOLLRDHUP:
* epoll_wait returns EPOLLIN | EPOLLRDHUP immediately
* No extra read() needed to detect the half-close
* Catches TCP FIN from the remote peer (shutdown(SHUT_WR))
*/| Operation | f_count change | epoll effect |
|---|---|---|
open() / socket() | → 1 | — |
dup(fd) | +1 (two fds share same struct file) | Two fds share same epitem; both trigger it |
fork() | +1 per open fd | Child's inherited fds also monitored by parent's epoll |
close(fd) with f_count > 1 | −1, file stays alive | epoll continues monitoring — epitem not removed |
close(fd) with f_count → 0 | → 0, file freed | Kernel auto-removes epitem from RB-tree |
epoll_ctl(DEL, fd) | no change | epitem explicitly removed — safe even if f_count > 1 |
fput() — Decrement and Free
/* fs/file_table.c */
void fput(struct file *file)
{
if (atomic_long_dec_and_test(&file->f_count)) {
/* f_count reached 0 — schedule final cleanup */
if (file->f_op->release)
file->f_op->release(file->f_inode, file); /* sock_close() for sockets */
/* remove from epoll RB-tree if still registered */
/* inode release, dentry put, memory freed */
}
}
/* get_file() is the inverse — used by fget() and dup */
static inline struct file *get_file(struct file *f)
{
atomic_long_inc(&f->f_count);
return f;
}Interview Prep — Must-Know Questions
| Question | Key answer |
|---|---|
| How does the kernel go from fd=5 to struct file *? | current->files->fdt->fd[5] — O(1) array lookup under RCU read lock. fget() increments f_count so the file object stays alive during the operation. |
| What does private_data in a socket's struct file point to? | struct socket * — the POSIX/BSD-facing socket object. It is the bridge from the VFS layer to the protocol stack. |
| Difference between struct socket and struct sock? | struct socket: POSIX/BSD API layer (ops vtable, state, wq, back-pointer to file). struct sock: protocol internal state (sk_receive_queue, sk_write_queue, TCP state machine, sk_data_ready callback). socket.sk points to sock. |
| What does tcp_poll() do when called from do_select()? | Calls poll_wait() to register a poll_table_entry on sk_wq→wait, then checks current readiness: sk_receive_queue non-empty → EPOLLIN; send buffer has space → EPOLLOUT. Returns the bitmask to do_select. |
| How does poll_wait() differ from ep_ptable_queue_proc()? | poll_wait() is called at sleep time on every select() call — the entry is added then removed on return. ep_ptable_queue_proc() is called once at epoll_ctl(ADD); the eppoll_entry is permanent and survives all subsequent epoll_wait() calls. |
| dup(fd) then close(fd) — is it still in epoll? | Yes. The epitem key is {file*, fd}. dup() increments f_count to 2; close(fd) decrements to 1 — the file stays alive via the dup'd fd. epoll continues monitoring and ep_poll_callback() can still fire. |
| What is sk_data_ready and when is it called? | A function pointer on struct sock, default = sock_def_readable(). Called by the TCP receive path after an skb is added to sk_receive_queue. sock_def_readable() calls wake_up_interruptible_all() on sk_wq→wait, triggering ep_poll_callback() or pollwake(). |