Part XXV — I/O Multiplexing

§ 36 — Kernel Wait Queue Internals

wait_queue_head_t & wait_queue_entry_t: lock, head, flags, private, func (§36.1) · How a process enters sleep: select O(n) vs epoll O(1) registration (§36.2) · __wake_up_common: pollwake O(N) vs ep_poll_callback O(1) (§36.3) · WQ_FLAG_EXCLUSIVE & EPOLLEXCLUSIVE: defeating thundering herd (§36.4) · Cleanup: poll_freewait vs epoll register-once (§36.5)

§ 36.1 — wait_queue_head_t & wait_queue_entry_t

Every blocking object in the Linux kernel — a socket, a pipe, a timer — owns a wait_queue_head_t. Processes that need to sleep until the object becomes ready add a wait_queue_entry_t to that list, then call schedule(). When data arrives the object calls wake_up(), which iterates the list and invokes each entry's func callback.

wait_queue_head Linked List

A socket's sk_wq→wait is a typical head. At any moment it may hold entries from multiple select processes and an epoll instance simultaneously — each entry has a different func that controls what happens at wakeup.

wait_queue_entry_t Field Annotations

The func field is the kernel's primary extensibility hook: select installs pollwake, epoll installs ep_poll_callback. The same wakeup machinery handles both, yet the per-entry callback determines whether waking means “rescan all fds” (select) or “add one epitem to rdllist” (epoll).

Struct Definitions

struct wait_queue_head {
    spinlock_t        lock;
    struct list_head  head;
};

struct wait_queue_entry {
    unsigned int        flags;   /* WQ_FLAG_EXCLUSIVE etc. */
    void               *private; /* sleeping task_struct * */
    wait_queue_func_t   func;    /* callback on wakeup */
    struct list_head    entry;   /* link in wait_queue_head.head */
};
FieldTypePurpose
wait_queue_head_t.lockspinlock_tProtects the list; interrupt-safe — can be taken from IRQ context
wait_queue_head_t.headlist_headHead of the doubly-linked entry list
wait_queue_entry_t.flagsunsigned intWQ_FLAG_EXCLUSIVE and friends — controls how many waiters are woken
wait_queue_entry_t.privatevoid *Sleeping process's task_struct *; for epoll it is eppoll_entry *
wait_queue_entry_t.funcwait_queue_func_tWakeup callback: select → pollwake(), epoll → ep_poll_callback()
wait_queue_entry_t.entrylist_headLink node joining this entry into wait_queue_head.head

§ 36.2 — How a Process Enters Sleep

Both select and epoll must register on socket wait queues to be notified when data arrives — but when and how many entries they insert differ fundamentally. select inserts one entry per fd at every call; epoll inserts entries at epoll_ctl(ADD) time and only one entry on its own wait queue at epoll_wait.

select() Sleep Path

epoll_wait() Sleep Path

Side-by-Side: Registration Cost

Interactive: Wait Queue Registration Visualizer

Switch between select and epoll mode, adjust the number of monitored fds, and call the syscall to see how many wait queue entries are inserted — and how much memory is consumed per call.

Demo 3515 — Wait Queue Registration Visualizer
Fds monitored: 10
Press the button to simulate wait queue registration
Kernel log
Choose a mode and click the button.
select
N entries per call
O(n)
epoll
1 entry per call
O(1)

§ 36.3 — How a Process is Woken Up

When data arrives on a socket, sk_data_ready() calls wake_up(), which runs __wake_up_common() — a single loop over the socket's wait queue. Every entry's func callback fires. For select that means re-scanning all N fds; for epoll it means appending one epitem to the ready list in O(1).

__wake_up_common Traversal

pollwake vs ep_poll_callback

Interactive: Wakeup Callback Comparison

Set N (monitored fds) and trigger a data-arrives event. Watch select scan all N fds via vfs_poll() while epoll adds a single epitem to rdllist in O(1).

Demo 3516 — Wakeup Callback Comparison
Fds monitored (N): 100
101000
sk_data_ready → wake_up_interruptible_sync_poll → __wake_up_common [shared path]
select
pollwake()
→ try_to_wake_up(task)
→ vfs_poll() × N fds
vfs_poll() scan0 / 100
vfs_poll calls: 0
complexity: O(N)
epoll
ep_poll_callback()
→ list_add_tail(epitem, rdllist)
→ wake_up(&ep→wq)
ep→rdllist: [ empty ]
rdllist ops: 0
complexity: O(1)
Kernel log
Adjust fds and click 'Data arrives on fd=42'.

§ 36.4 — WQ_FLAG_EXCLUSIVE & Thundering Herd

When multiple processes wait on the same event — for example, worker threads all blocked on accept() for the same listen socket — a single new connection wakes all of them. Only one can succeed; the rest return EAGAIN and sleep again. This thundering herd wastes N−1 context switches per event. WQ_FLAG_EXCLUSIVE fixes it by telling __wake_up_common to stop after waking the first exclusive waiter.

Without WQ_FLAG_EXCLUSIVE — Thundering Herd

With WQ_FLAG_EXCLUSIVE

MechanismWithout EXCLUSIVEWith WQ_FLAG_EXCLUSIVE
Event arrivesAll N processes wakeOnly 1 process wakes
accept() result1 success, N−1 EAGAIN1 success, others stay sleeping
Context switchesN1
epoll interfaceEPOLLEXCLUSIVE flag
Kernel impladd_wait_queue() — inserts at headadd_wait_queue_exclusive() — inserts at tail after non-exclusive entries

EPOLLEXCLUSIVE Example

/* EPOLLEXCLUSIVE example */
struct epoll_event ev = {
    .events = EPOLLIN | EPOLLEXCLUSIVE,
    .data.fd = listenfd,
};
epoll_ctl(epfd, EPOLL_CTL_ADD, listenfd, &ev);
/* kernel: eppoll_entry.wait.flags |= WQ_FLAG_EXCLUSIVE
   __wake_up_common stops after 1 exclusive entry is woken */

add_wait_queue_exclusive() inserts the entry at the tail of the wait queue, after all non-exclusive entries. This ensures non-exclusive waiters (e.g. poll() monitors) are still woken first, while exclusive waiters compete only one-at-a-time.

§ 36.5 — Cleanup: Removing Wait Queue Entries

The asymmetry between select and epoll is most visible in cleanup cost. Every select() call must add N entries before sleeping and remove all N entries on return — even if no fd was ready. At high call rates this dominates CPU time. epoll socket entries are installed once at epoll_ctl(ADD) and survive across all subsequent epoll_wait() calls; each wait only touches a single entry on ep->wq.

select / poll — Per-Call Overhead

epoll — Register Once, Wait Many Times

Operationselect (N=1000, 10K calls/sec)epoll
add_wait_queue / sec10,000,0000 — registered once at epoll_ctl time
remove_wait_queue / sec10,000,0000
Memory alloc / sec10M × poll_table_entry0 — entries persist between calls
Entry lifetimeOne select() callEntire fd registration period

select Cleanup — poll_freewait()

/* select cleanup on return (net/core/poll.c) */
void poll_freewait(struct poll_wqueues *pwq) {
    struct poll_table_page *p = pwq->table;
    while (p) {
        struct poll_table_entry *e = p->entries;
        remove_wait_queue(e->wait_address, &e->wait); /* N times */
        fput(e->filp);
        e++;
        /* free page */
    }
}
/* epoll_wait return: no such cleanup */

Interview Prep — Must-Know Questions

QuestionKey answer
Two fields of wait_queue_head_t?spinlock_t lock (interrupt-safe protection — can be taken from IRQ context) and list_head head (the doubly-linked list of wait_queue_entry_t nodes).
wait_queue_entry_t.func: select vs epoll?select installs pollwake() — wakes the process which then calls vfs_poll() on all N fds (O(N) scan). epoll installs ep_poll_callback() — appends one epitem to ep->rdllist in O(1), then wakes epoll_wait.
Why does select wakeup cost O(N) even after wake_up() returns?pollwake() calls try_to_wake_up(task), putting the process back on the run queue. When it runs, do_select() calls vfs_poll() on every one of its N fds to find which are ready — there is no ready list.
What is WQ_FLAG_EXCLUSIVE? How does EPOLLEXCLUSIVE use it?WQ_FLAG_EXCLUSIVE tells __wake_up_common to decrement nr_exclusive after waking this entry and stop iterating when it reaches zero. epoll sets eppoll_entry.wait.flags |= WQ_FLAG_EXCLUSIVE so only one epoll instance is woken per listen-socket event.
How many add_wait_queue calls does epoll make per epoll_wait?Exactly 1 — onto ep->wq. Socket eppoll_entries are registered once at epoll_ctl(ADD) time and persist across all subsequent epoll_wait calls.
Walk through __wake_up_common. When does it stop?It holds wqh->lock, then list_for_each_entry_safe over wqh->head. For each entry it calls curr->func(). If the entry has WQ_FLAG_EXCLUSIVE, it decrements nr_exclusive; when nr_exclusive reaches 0 it breaks. Without EXCLUSIVE entries it iterates the entire list.
N=10000 fds, 1000 calls/sec: select vs epoll memory overhead?select: 10,000 × 1,000 = 10M poll_table_entry alloc/free per second (32 B each = 320 MB/sec of churn). epoll: 0 — eppoll_entries are allocated once at epoll_ctl(ADD) time and freed only at epoll_ctl(DEL) or close().