§ 36 — Kernel Wait Queue Internals
wait_queue_head_t & wait_queue_entry_t: lock, head, flags, private, func (§36.1) · How a process enters sleep: select O(n) vs epoll O(1) registration (§36.2) · __wake_up_common: pollwake O(N) vs ep_poll_callback O(1) (§36.3) · WQ_FLAG_EXCLUSIVE & EPOLLEXCLUSIVE: defeating thundering herd (§36.4) · Cleanup: poll_freewait vs epoll register-once (§36.5)
§ 36.1 — wait_queue_head_t & wait_queue_entry_t
Every blocking object in the Linux kernel — a socket, a pipe, a timer — owns a wait_queue_head_t. Processes that need to sleep until the object becomes ready add a wait_queue_entry_t to that list, then call schedule(). When data arrives the object calls wake_up(), which iterates the list and invokes each entry's func callback.
wait_queue_head Linked List
A socket's sk_wq→wait is a typical head. At any moment it may hold entries from multiple select processes and an epoll instance simultaneously — each entry has a different func that controls what happens at wakeup.
wait_queue_entry_t Field Annotations
The func field is the kernel's primary extensibility hook: select installs pollwake, epoll installs ep_poll_callback. The same wakeup machinery handles both, yet the per-entry callback determines whether waking means “rescan all fds” (select) or “add one epitem to rdllist” (epoll).
Struct Definitions
struct wait_queue_head {
spinlock_t lock;
struct list_head head;
};
struct wait_queue_entry {
unsigned int flags; /* WQ_FLAG_EXCLUSIVE etc. */
void *private; /* sleeping task_struct * */
wait_queue_func_t func; /* callback on wakeup */
struct list_head entry; /* link in wait_queue_head.head */
};| Field | Type | Purpose |
|---|---|---|
wait_queue_head_t.lock | spinlock_t | Protects the list; interrupt-safe — can be taken from IRQ context |
wait_queue_head_t.head | list_head | Head of the doubly-linked entry list |
wait_queue_entry_t.flags | unsigned int | WQ_FLAG_EXCLUSIVE and friends — controls how many waiters are woken |
wait_queue_entry_t.private | void * | Sleeping process's task_struct *; for epoll it is eppoll_entry * |
wait_queue_entry_t.func | wait_queue_func_t | Wakeup callback: select → pollwake(), epoll → ep_poll_callback() |
wait_queue_entry_t.entry | list_head | Link node joining this entry into wait_queue_head.head |
§ 36.2 — How a Process Enters Sleep
Both select and epoll must register on socket wait queues to be notified when data arrives — but when and how many entries they insert differ fundamentally. select inserts one entry per fd at every call; epoll inserts entries at epoll_ctl(ADD) time and only one entry on its own wait queue at epoll_wait.
select() Sleep Path
epoll_wait() Sleep Path
Side-by-Side: Registration Cost
Interactive: Wait Queue Registration Visualizer
Switch between select and epoll mode, adjust the number of monitored fds, and call the syscall to see how many wait queue entries are inserted — and how much memory is consumed per call.
§ 36.3 — How a Process is Woken Up
When data arrives on a socket, sk_data_ready() calls wake_up(), which runs __wake_up_common() — a single loop over the socket's wait queue. Every entry's func callback fires. For select that means re-scanning all N fds; for epoll it means appending one epitem to the ready list in O(1).
__wake_up_common Traversal
pollwake vs ep_poll_callback
Interactive: Wakeup Callback Comparison
Set N (monitored fds) and trigger a data-arrives event. Watch select scan all N fds via vfs_poll() while epoll adds a single epitem to rdllist in O(1).
§ 36.4 — WQ_FLAG_EXCLUSIVE & Thundering Herd
When multiple processes wait on the same event — for example, worker threads all blocked on accept() for the same listen socket — a single new connection wakes all of them. Only one can succeed; the rest return EAGAIN and sleep again. This thundering herd wastes N−1 context switches per event. WQ_FLAG_EXCLUSIVE fixes it by telling __wake_up_common to stop after waking the first exclusive waiter.
Without WQ_FLAG_EXCLUSIVE — Thundering Herd
With WQ_FLAG_EXCLUSIVE
| Mechanism | Without EXCLUSIVE | With WQ_FLAG_EXCLUSIVE |
|---|---|---|
| Event arrives | All N processes wake | Only 1 process wakes |
| accept() result | 1 success, N−1 EAGAIN | 1 success, others stay sleeping |
| Context switches | N | 1 |
| epoll interface | — | EPOLLEXCLUSIVE flag |
| Kernel impl | add_wait_queue() — inserts at head | add_wait_queue_exclusive() — inserts at tail after non-exclusive entries |
EPOLLEXCLUSIVE Example
/* EPOLLEXCLUSIVE example */
struct epoll_event ev = {
.events = EPOLLIN | EPOLLEXCLUSIVE,
.data.fd = listenfd,
};
epoll_ctl(epfd, EPOLL_CTL_ADD, listenfd, &ev);
/* kernel: eppoll_entry.wait.flags |= WQ_FLAG_EXCLUSIVE
__wake_up_common stops after 1 exclusive entry is woken */add_wait_queue_exclusive() inserts the entry at the tail of the wait queue, after all non-exclusive entries. This ensures non-exclusive waiters (e.g. poll() monitors) are still woken first, while exclusive waiters compete only one-at-a-time.
§ 36.5 — Cleanup: Removing Wait Queue Entries
The asymmetry between select and epoll is most visible in cleanup cost. Every select() call must add N entries before sleeping and remove all N entries on return — even if no fd was ready. At high call rates this dominates CPU time. epoll socket entries are installed once at epoll_ctl(ADD) and survive across all subsequent epoll_wait() calls; each wait only touches a single entry on ep->wq.
select / poll — Per-Call Overhead
epoll — Register Once, Wait Many Times
| Operation | select (N=1000, 10K calls/sec) | epoll |
|---|---|---|
| add_wait_queue / sec | 10,000,000 | 0 — registered once at epoll_ctl time |
| remove_wait_queue / sec | 10,000,000 | 0 |
| Memory alloc / sec | 10M × poll_table_entry | 0 — entries persist between calls |
| Entry lifetime | One select() call | Entire fd registration period |
select Cleanup — poll_freewait()
/* select cleanup on return (net/core/poll.c) */
void poll_freewait(struct poll_wqueues *pwq) {
struct poll_table_page *p = pwq->table;
while (p) {
struct poll_table_entry *e = p->entries;
remove_wait_queue(e->wait_address, &e->wait); /* N times */
fput(e->filp);
e++;
/* free page */
}
}
/* epoll_wait return: no such cleanup */Interview Prep — Must-Know Questions
| Question | Key answer |
|---|---|
| Two fields of wait_queue_head_t? | spinlock_t lock (interrupt-safe protection — can be taken from IRQ context) and list_head head (the doubly-linked list of wait_queue_entry_t nodes). |
| wait_queue_entry_t.func: select vs epoll? | select installs pollwake() — wakes the process which then calls vfs_poll() on all N fds (O(N) scan). epoll installs ep_poll_callback() — appends one epitem to ep->rdllist in O(1), then wakes epoll_wait. |
| Why does select wakeup cost O(N) even after wake_up() returns? | pollwake() calls try_to_wake_up(task), putting the process back on the run queue. When it runs, do_select() calls vfs_poll() on every one of its N fds to find which are ready — there is no ready list. |
| What is WQ_FLAG_EXCLUSIVE? How does EPOLLEXCLUSIVE use it? | WQ_FLAG_EXCLUSIVE tells __wake_up_common to decrement nr_exclusive after waking this entry and stop iterating when it reaches zero. epoll sets eppoll_entry.wait.flags |= WQ_FLAG_EXCLUSIVE so only one epoll instance is woken per listen-socket event. |
| How many add_wait_queue calls does epoll make per epoll_wait? | Exactly 1 — onto ep->wq. Socket eppoll_entries are registered once at epoll_ctl(ADD) time and persist across all subsequent epoll_wait calls. |
| Walk through __wake_up_common. When does it stop? | It holds wqh->lock, then list_for_each_entry_safe over wqh->head. For each entry it calls curr->func(). If the entry has WQ_FLAG_EXCLUSIVE, it decrements nr_exclusive; when nr_exclusive reaches 0 it breaks. Without EXCLUSIVE entries it iterates the entire list. |
| N=10000 fds, 1000 calls/sec: select vs epoll memory overhead? | select: 10,000 × 1,000 = 10M poll_table_entry alloc/free per second (32 B each = 320 MB/sec of churn). epoll: 0 — eppoll_entries are allocated once at epoll_ctl(ADD) time and freed only at epoll_ctl(DEL) or close(). |