§ 34 — epoll Deep Dive: Data Structures · ep_poll_callback · LT/ET · EPOLLONESHOT · io_uring · Reactor/Proactor
eventpoll → rbr (RB-tree) → epitem → eppoll_entry → sk_wq chain (§34.1) · ep_poll_callback: why epoll_wait is O(k) not O(n) (§34.2) · Level Trigger vs Edge Trigger with drain loop (§34.3) · EPOLLONESHOT and EPOLLEXCLUSIVE thundering-herd fix (§34.4) · ET pitfalls: write readiness, accept loop, self-pipe (§34.5) · epoll vs kqueue vs IOCP vs io_uring syscall comparison (§34.6) · Reactor and Proactor patterns (§34.7) · Blocking → poll → epoll LT → epoll ET → io_uring echo server evolution (§34.8)
1. § 34.1 — epoll Data Structures
Three kernel structs form the backbone of every epoll instance. Understanding them is the prerequisite for every epoll interview question.
struct eventpoll — the top-level container
Created by epoll_create() and backed by an anonymous inode. The returned epfd is a regular file descriptor pointing to this struct.
| Field | Type | Purpose |
|---|---|---|
| rbr | rb_root_cached | Root of the RB-tree holding all registered epitem structs, keyed by {file *, fd}. O(log n) insert/lookup/delete via epoll_ctl(). |
| rdllist | list_head | Doubly-linked list of ready epitems (those with pending I/O events). ep_poll_callback appends here; epoll_wait drains here. |
| lock | spinlock_t | Protects rdllist from concurrent ep_poll_callback calls (softirq context) and epoll_wait draining. |
| sem | rw_semaphore | Protects the RB-tree during epoll_ctl() ADD/MOD/DEL operations. |
| wq | wait_queue_head_t | Wait queue for processes sleeping in epoll_wait(). ep_poll_callback calls wake_up(&ep→wq) to wake them. |
| poll_wait | wait_queue_head_t | Nested epoll: when an epoll fd is itself registered in another epoll. Used by ep_eventpoll_poll(). |
struct epitem — one per registered fd
Created by epoll_ctl(EPOLL_CTL_ADD). Lives in the RB-tree and optionally in the rdllist when its fd is ready.
| Field | Purpose |
|---|---|
| rbn | rb_node — embeds epitem into eventpoll.rbr. Key = {file *, fd} so the same fd number but different file* (dup'd fds) can coexist. |
| rdllink | list_head — node for inclusion in eventpoll.rdllist when this fd is ready. Added by ep_poll_callback(); removed by ep_send_events(). |
| ep | Back-pointer to parent eventpoll. Used by ep_poll_callback to find rdllist and wq. |
| ffd | {file *, fd} pair identifying the monitored fd. Used as RB-tree key. |
| event | struct epoll_event {__u32 events; __u64 data}. events = EPOLLIN|EPOLLOUT|… mask; data = user-supplied cookie returned on wakeup. |
| pwqlist | List of eppoll_entry structs. One entry per wait queue head the fd has (sockets usually have one: sk_wq). |
struct eppoll_entry — the wait queue hook
The bridge between the monitored socket and the epoll instance. Its wait.func = ep_poll_callback is what fires when the socket becomes ready.
| Field | Purpose |
|---|---|
| whead | wait_queue_head_t * — pointer to the socket's sk_wq.wait head where this entry is registered. Needed to remove the entry on epoll_ctl(DEL). |
| base | epitem * — back-pointer to the owning epitem. ep_poll_callback uses this to reach the eventpoll and rdllist. |
| wait | wait_queue_entry_t with .func = ep_poll_callback. This is the actual entry inserted into the socket's sk_wq. When sk_data_ready fires wake_up(), this func is called. |
Full Data Structure Chain
The complete ownership chain from eventpoll down to the socket wait queue. This is what makes epoll O(1) — the callback is installed at registration time, not at wait time.
Interactive: epoll Data Structure Explorer
Add and remove fds, trigger I/O events, and call epoll_wait(). Watch the RB-tree rebalance on insert and epitems move into the rdllist on I/O events.
2. § 34.2 — ep_poll_callback: The Key to O(1)
The fundamental difference between epoll and select/poll is when the kernel identifies ready fds. select/poll finds them at wakeup by scanning everything. epoll uses a callback installed at registration time — ready fds are delivered directly, without a scan.
Registration: epoll_ctl(ADD) installs the callback
| Step | What happens | Data structure |
|---|---|---|
| alloc epitem | kmalloc(sizeof(epitem)). Fill ffd={file*,fd}, event={events,data}. | struct epitem on kernel heap |
| RB-tree insert | ep_rbtree_insert(): find position by {file*,fd} key, rb_link_node(), rb_insert_color(). | eventpoll.rbr — O(log n) |
| alloc eppoll_entry | kmalloc(sizeof(eppoll_entry)). Set base=epitem, wait.func=ep_poll_callback. | struct eppoll_entry on heap |
| install in sk_wq | add_wait_queue(sock→sk_wq, &ee→wait). Now sk_data_ready() will call ep_poll_callback. | socket sk_wq.wait list |
ep_poll_callback() — called from softirq context
This function runs in softirq/interrupt context — not in any process context. It must be fast and non-blocking.
| # | Action | Why |
|---|---|---|
| 1 | key & epi→event.events → match? | Check if the event type (EPOLLIN/EPOLLOUT/…) matches what this epitem is watching. If not, return immediately. |
| 2 | spin_lock_irqsave(&ep→lock) | Protect rdllist from concurrent callbacks (multiple sockets can fire simultaneously on different CPUs). |
| 3 | list_add_tail(&epi→rdllink, &ep→rdllist) | O(1) append. This is the only work done per ready fd. |
| 4 | spin_unlock_irqrestore(&ep→lock) | Release rdllist lock. |
| 5 | wake_up(&ep→wq) | Wake the process sleeping in epoll_wait(). Only epoll's own wait queue — not a thundering herd. |
epoll_wait() flow
| Condition | Action |
|---|---|
| rdllist non-empty on entry | Skip sleep. Call ep_send_events(): copy up to maxevents ready events to userspace epoll_event array. Return count. |
| rdllist empty | Add current process to ep→wq (TASK_INTERRUPTIBLE). Call schedule_hrtimeout() with timeout. Sleep. |
| Woken by ep_poll_callback | Return from schedule. Re-check rdllist. Call ep_send_events(). |
| LT re-queue | After ep_send_events(), if epitem is still ready (buffer still has data), re-add to rdllist so next epoll_wait sees it. |
Full flow: NIC interrupt → epoll_wait returns
Why O(1)? — epoll vs select wakeup comparison
The critical difference: select wakes all waiting processes and each re-scans all n fds. epoll fires the callback only for the ready fd, appends to rdllist in O(1), and wakes only epoll_wait — which then drains only the k ready entries.
- ep_poll_callback touches only 1 epitem per ready fd — O(1) per event
- epoll_wait drains only rdllist — O(k) where k = number of ready fds
- The RB-tree of n fds is never touched during wakeup
- If k << n (typical: 10 ready out of 10K), cost is effectively constant
Interactive: ep_poll_callback Trace — 1000 fds, 3 ready
Step through the full path: NIC interrupt → softirq → ep_poll_callback fires on exactly 3 fds → rdllist gets 3 entries → epoll_wait returns 3. 997 fds are never touched.
3. § 34.3 — Level Trigger (LT) vs Edge Trigger (ET)
LT and ET describe when epoll reports a fd as ready. The difference is consequential: ET requires a mandatory drain loop, and getting it wrong silently loses data.
Level Trigger (LT) — default
| Property | LT behavior |
|---|---|
| Trigger condition | fd is reported as long as its buffer has data — regardless of when data arrived |
| Partial read safe? | YES. Read 50 of 100 bytes; next epoll_wait still returns the fd (50B remaining) |
| Kernel mechanism | After ep_send_events(), if epitem still ready (vfs_poll returns events), re-add to rdllist |
| Use case | Any read/write pattern. Compatible with blocking code that reads all available data once |
| Drawback | Extra epoll_wait wakeups if app consistently partial-reads. Extra rdllist re-insertion overhead |
Edge Trigger (ET) — EPOLLET flag
| Property | ET behavior |
|---|---|
| Trigger condition | fd is reported exactly once when a state transition occurs: buffer goes from empty → non-empty (new data arrives) |
| Partial read safe? | NO. Read 50 of 100 bytes → epoll will NOT notify again. 50B stuck until more data arrives. |
| Kernel mechanism | epitem removed from rdllist after reporting. Only re-added when ep_poll_callback fires again (new event) |
| Mandatory pattern | Loop read() until EAGAIN/EWOULDBLOCK. Never assume one read() drains all data. |
| Requirement | fd MUST be O_NONBLOCK. ET + blocking fd = deadlock when buffer is partially read |
| Use case | High-performance servers (nginx, Redis). Fewer epoll_wait calls per fd when you always drain. |
ET drain loop — mandatory pattern
| Scenario | LT result | ET result |
|---|---|---|
| 100B arrives, you read 100B (drain) | Next wait: fd NOT reported (buffer empty) ✓ | Next wait: fd NOT reported ✓ — same |
| 100B arrives, you read 50B | Next wait: fd reported again (50B left) ✓ | Next wait: fd NOT reported ✗ — 50B LOST |
| 100B arrives, then 50B more before next wait | Reports ready once (150B total available) | Reports ready once per data arrival (may coalesce) |
| EPOLLOUT on write: send buffer not full | Reported every epoll_wait (wasteful!) — disable when not needed | Reported only when buffer transitions full→not-full |
Interactive: LT vs ET Simulator
Switch between LT and ET. Send 100 bytes, read a partial amount with the slider, then call epoll_wait again — see exactly what each mode reports and what gets lost.
4. § 34.4 — EPOLLONESHOT and EPOLLEXCLUSIVE
Two flags that solve multi-threaded epoll problems: EPOLLONESHOT prevents duplicate fd delivery across threads; EPOLLEXCLUSIVE prevents thundering herd on a shared listen socket.
EPOLLONESHOT
After epoll_wait() returns an fd registered with EPOLLONESHOT, the kernel automatically disables that fd — as if EPOLL_CTL_DEL was called. No other thread can get the same fd on a subsequent epoll_wait() until you explicitly re-arm it.
| Property | Detail |
|---|---|
| Problem solved | Multi-threaded server: two threads both wake from epoll_wait() for the same fd. Both call read() — race condition, corrupted state. |
| Mechanism | After ep_send_events() returns the fd, kernel clears EPOLLIN|EPOLLOUT from epi→event.events. fd effectively disabled. |
| Re-arm | epoll_ctl(EPOLL_CTL_MOD, fd, EPOLLONESHOT | EPOLLIN) after processing. Adds cost of one extra syscall per event. |
| Alternative | Per-thread epoll fds + EPOLLEXCLUSIVE (avoids re-arm overhead for accept loops). |
EPOLLEXCLUSIVE (Linux 4.5+)
When multiple threads each have their own epoll fd, all watching the same listen socket, a new connection wakes all of them — the classic thundering herd. EPOLLEXCLUSIVE uses WQ_FLAG_EXCLUSIVE in the wait queue entry so that only one epoll fd's epoll_wait() is woken per event.
| Property | Detail |
|---|---|
| Problem solved | 4 threads × 4 epoll fds, all watching listen socket fd=3. New connection: all 4 epoll_wait() return. 3 threads get EAGAIN from accept(). |
| Mechanism | add_wait_queue_exclusive(): sets WQ_FLAG_EXCLUSIVE on the eppoll_entry. wake_up() stops after waking the first exclusive waiter. |
| Which thread wakes? | Unspecified (first in the wait queue list). Kernel may rotate over time but no guarantee of fairness. |
| vs SO_REUSEPORT | SO_REUSEPORT: each thread has its own listen socket; kernel load-balances at accept() level. EPOLLEXCLUSIVE: one shared listen socket, one epoll fd woken. SO_REUSEPORT is generally preferred for better load distribution. |
| Restriction | Cannot combine EPOLLEXCLUSIVE with EPOLLONESHOT. Cannot use with epoll_ctl(MOD). Only valid at ADD time. |
Comparison: EPOLLONESHOT vs EPOLLEXCLUSIVE vs SO_REUSEPORT
| Mechanism | Problem | Granularity | Extra cost |
|---|---|---|---|
| EPOLLONESHOT | One thread processes one fd event | Per-fd event | epoll_ctl(MOD) re-arm per event |
| EPOLLEXCLUSIVE | One epoll_wait woken per listen event | Per-epoll-fd | None after setup |
| SO_REUSEPORT | Kernel distributes connections across sockets | Per-connection | Extra listen socket per thread |
Interactive: Thundering Herd Demonstration
Toggle EPOLLEXCLUSIVE and send connections to see the difference. Without it, all 4 threads wake and 3 waste a context switch on EAGAIN.
5. § 34.5 — ET Drain Rule and Common Pitfalls
ET mode is faster than LT in steady state, but it has sharp edges. All of these pitfalls cause silent data loss or hangs — no error is returned.
The Drain Rule
read() until EAGAIN / EWOULDBLOCK. Every call. No exceptions. If you return early, data sits in the buffer silently — epoll will not notify you again until new data arrives (a new state transition).Pitfall 1 — Read pitfall (most common)
| Wrong | Correct |
|---|---|
| // ET — reads only once n = read(fd, buf, sz); process(buf, n); // done — WRONG: may have more data | // ET — drain until EAGAIN while ((n = read(fd, buf, sz)) > 0) process(buf, n); if (errno != EAGAIN) close(fd); |
Pitfall 2 — EPOLLOUT write readiness
EPOLLOUT in ET mode fires when the send buffer transitions from full → not full (state change). If you register EPOLLOUT and the buffer is never full, it fires on everyepoll_wait() — a busy loop. Pattern: enable EPOLLOUT only when a write returns EAGAIN (buffer full); disable it once writes succeed.
Pitfall 3 — Accept loop on ET listen socket
| Scenario | Result |
|---|---|
| ET listen socket, 5 connections queued, accept() once | 4 connections remain in backlog. No more EPOLLIN until the 6th connection arrives. accept() those 4 — never! |
| ET listen socket, drain loop: while(accept()!=-1||errno==EINTR) | All 5 connections accepted. EAGAIN breaks the loop. Correct. |
| LT listen socket, accept() once | Safe: next epoll_wait still reports listen fd as long as backlog non-empty. |
Pitfall 4 — ET + blocking fd = hang
ET requires O_NONBLOCK. If you use a blocking fd with ET, the drain loop calls read() when the buffer is empty — the call blocks indefinitely instead of returning EAGAIN. The thread hangs. Always set fcntl(fd, F_SETFL, O_NONBLOCK) before registering with EPOLLET.
Pitfall 5 — Self-pipe trick (pre-eventfd)
Before eventfd() and timerfd(), waking epoll_wait() from another thread required a pipe: register pipefd[0] with EPOLLIN, write one byte to pipefd[1] to trigger the wakeup. Now use eventfd(0, EFD_NONBLOCK) instead — lighter and correct with both LT and ET.
Interactive: ET Drain Pitfall — 3 data bursts
Step through 3 data bursts in WRONG mode (read once per event) vs CORRECT mode (drain loop). Watch the buffer state and see exactly how much data is lost.
6. § 34.6 — epoll vs kqueue vs IOCP vs io_uring
epoll is Linux-only and readiness-based (reactor). Other platforms and newer Linux APIs offer different trade-offs — from the general-purpose kqueue to the truly async io_uring.
epoll vs kqueue
| Feature | epoll | kqueue |
|---|---|---|
| Platform | Linux 2.6+ | BSD, macOS, iOS |
| Change + wait | Separate: epoll_ctl() then epoll_wait() | Single kevent() call for both |
| Event types | fds only (sockets, pipes, eventfd, timerfd, signalfd) | fds + processes + signals + timers + user events |
| Edge trigger | EPOLLET flag | EV_CLEAR flag (EV_EOF separate) |
| Batch changes | No (one epoll_ctl per fd) | Yes: changelist[] in kevent() |
| Process monitoring | No (needs waitpid) | EVFILT_PROC: note process exit/fork |
| Timer events | timerfd_create() workaround | EVFILT_TIMER built-in |
IOCP (Windows) — Proactor
IOCP is fundamentally different: the kernel completes I/O into your buffer, then posts a completion packet. You never call read() — the data is already there when your handler runs. This is the proactor pattern.
| Property | epoll (reactor) | IOCP (proactor) |
|---|---|---|
| Notification | 'fd is ready to read' | 'I/O is complete, data in your buffer' |
| Who reads data | Application calls read() | Kernel reads directly into app buffer |
| Phases per request | 2: wait + read | 1: completion callback |
| Buffer ownership | App allocates after notification | App allocates before submission |
| Concurrency model | Reactor: handle ready fds | Thread pool drains completion port |
io_uring (Linux 5.1+) — Async Ring
io_uring uses two shared memory rings between kernel and userspace. Submit I/O without a syscall; collect completions without a syscall. With SQPOLL, even submission is zero-syscall — a kernel thread polls the SQ ring continuously.
| Feature | Detail |
|---|---|
| Submission Queue (SQ) | mmap'd ring. App writes SQE (io_uring_sqe) at tail. Kernel reads from head. |
| Completion Queue (CQ) | mmap'd ring. Kernel writes CQE (io_uring_cqe) at tail. App reads from head. |
| Zero-syscall submit | io_uring_prep_recv() writes SQE directly — no syscall. Batch with io_uring_submit() (1 syscall per batch). |
| SQPOLL | io_uring_setup(IORING_SETUP_SQPOLL): kernel thread polls SQ. io_uring_enter() not needed for submit — 0 syscalls. |
| Fixed buffers | io_uring_register(IORING_REGISTER_BUFFERS): pre-pin user pages. Avoids page-fault on each I/O. |
| Registered fds | io_uring_register(IORING_REGISTER_FILES): kernel holds file references. Faster than fd lookup per SQE. |
| io_uring vs epoll | epoll: 3 syscalls/request (epoll_wait+read+write). io_uring SQPOLL: ~0 syscalls/request. |
Quick API Comparison
| API | OS | Model | Syscalls/req (steady) | Data in buf at callback? |
|---|---|---|---|---|
| select/poll | POSIX | Reactor | 3+ (select+read+write) | No — you call read() |
| epoll | Linux | Reactor | 3 (wait+read+write) | No — you call read() |
| kqueue | BSD/macOS | Reactor | 2 (kevent+read+write) or 3 | No — you call read() |
| IOCP | Windows | Proactor | 1 (GetQueuedCS) | YES — kernel filled it |
| io_uring (normal) | Linux 5.1+ | Proactor | ~1-2 per batch | YES — kernel filled it |
| io_uring (SQPOLL) | Linux 5.1+ | Proactor | ~0 | YES — kernel filled it |
Demo 3509 — Syscall Count: epoll vs io_uring
Run the simulation to see exact syscall counts at various connection scales. The SQPOLL column drops to near-zero — that is the fundamental advantage of the shared ring model.
7. § 34.7 — Reactor vs Proactor Pattern
epoll implements the reactor pattern: the kernel tells you a fd is ready, then your code calls read(). io_uring and IOCP implement the proactor pattern: the kernel completes I/O into your buffer and notifies you when it's done — no read() needed.
Reactor Pattern (epoll)
Proactor Pattern (io_uring)
Side-by-side Loop Comparison
| Property | Reactor (epoll) | Proactor (io_uring / IOCP) |
|---|---|---|
| Notification type | fd is ready to read/write | I/O is complete, data in your buffer |
| Who calls read() | Application (2nd phase) | Kernel (during submission) |
| Phases per request | 2: epoll_wait → read() | 1: wait for CQE |
| Buffer timing | Allocated after event fires | Pre-allocated before submission |
| Syscalls per request | 3 (epoll_wait + read + write) | 0–2 (SQPOLL: ~0) |
| Cancellation | epoll_ctl(DEL) | io_uring_prep_cancel(sqe) |
| Kernel support | Linux 2.6+ (universal) | Linux 5.1+ (5.19+ for stability) |
Reactor Event Loop — Step by Step
| Step | Call | What happens |
|---|---|---|
| 1 | epoll_ctl(ADD, fd, EPOLLIN) | Kernel inserts epitem into RB-tree; registers ep_poll_callback on fd's wait queue |
| 2 | epoll_wait(epfd, events, N, -1) | Thread sleeps in epoll's wait queue until rdllist non-empty |
| 3 | ep_poll_callback fires (softirq) | Moves epitem to rdllist; wakes epoll_wait thread |
| 4 | epoll_wait returns k events | Copies up to N epoll_events to userspace |
| 5 | Handler: read(fd, buf, len) | App drains socket buffer (ET: drain loop until EAGAIN) |
| 6 | Handler: write(fd, resp, n) | App sends response |
| 7a (LT) | Loop back to epoll_wait | fd stays armed — epoll re-reports if data remains |
| 7b (ONESHOT) | epoll_ctl(MOD, EPOLLIN|EPOLLONESHOT) | Re-arm: fd was auto-disabled after first event delivery |
Demo 3510 — Reactor Event Loop Simulator
Add connections, ARM them into epoll, then run epoll_wait to watch the dispatch cycle. Toggle EPOLLONESHOT to see the re-arm step appear after each write.
8. § 34.8 — Echo Server Evolution
The echo server is the canonical benchmark for I/O multiplexing. Each stage removes a bottleneck from the previous one — from blocking single-client all the way to io_uring's zero-syscall proactor.
| Stage | Model | Max conns | Syscalls/req | fd scan | Key limitation |
|---|---|---|---|---|---|
| blocking | sequential/O(n) | 1 client | 3 (accept+read+write) | no — O(k) | Second client hangs until first disconnects |
| poll | sequential/O(n) | ~hundreds | 4+ (poll+accept/read+write) | YES — O(n) | poll() scans all fds even if 1 is ready — degrades at scale |
| epoll-lt | reactor O(k) | ~10k–100k | 3 (epoll_wait+read+write) | no — O(k) | LT: fd stays in rdllist if data left — safe but extra wakeups |
| epoll-et | reactor O(k) | ~100k+ | 3 + drain loop | no — O(k) | Must drain fully to EAGAIN; missing data if loop exits early |
| io-uring | proactor | 1M+ | ~0 (SQPOLL) | no — O(k) | Linux 5.1+ required; complex setup; SQPOLL needs root or CAP_SYS_NICE |
Stage Explorer
Click a stage in the table above or use the tabs below to explore each implementation.
/* Stage 1: blocking single-client echo server */
int main(void) {
int lfd = tcp_listen(8080);
for (;;) {
/* blocks here — only ONE client at a time */
int cfd = accept(lfd, NULL, NULL);
char buf[4096];
ssize_t n;
/* blocks in read() — server frozen for all others */
while ((n = read(cfd, buf, sizeof buf)) > 0)
write(cfd, buf, n);
close(cfd);
/* next accept() — previous client must be gone */
}
}
/* syscalls per request: accept(1) + read(1) + write(1) = 3
concurrency: 1 (second client waits in kernel accept queue) */| Stage | Key design decision | What it unlocks |
|---|---|---|
| blocking | accept() + read() in sequence | Simple; works for 1 client |
| poll() | Single call multiplexes N fds | Multiple clients; O(n) cost |
| epoll LT | Kernel tracks readiness in RB-tree + rdllist | O(k) wakeup; scales to 100k |
| epoll ET | Only notify on state change; drain loop | Fewer wakeups; lower CPU at high load |
| io_uring SQPOLL | Kernel completes I/O; shared mmap rings | ~0 syscalls/req; 1M+ connections |
Demo 3511 — Echo Server Evolution: Syscall & Capacity Comparison
Run to see the quantitative breakdown: syscalls per second at C10K scale and the O(n) cost of poll() vs O(k) of epoll at 10,000 connections.
Interview Prep — Must-Know Questions
| Question | Key answer |
|---|---|
| What is ep_poll_callback? When is it called? | Installed as wait_queue_entry_t.func in the socket's sk_wq. Called from softirq/interrupt context when sk_data_ready() fires. Adds epitem to eventpoll.rdllist (O(1)) and wakes epoll_wait. |
| Why is epoll_wait O(k) not O(n)? | ep_poll_callback adds only ready epitems to rdllist. epoll_wait drains rdllist — touches exactly k ready fds, never the other n−k. The RB-tree is never scanned on wakeup. |
| LT vs ET: what happens if you read only 50 of 100 bytes? | LT: fd is re-added to rdllist — epoll_wait reports it again next call. ET: no re-add — 50 bytes sit silently in buffer until new data arrives. Mandatory: drain loop until EAGAIN. |
| ET drain loop — why mandatory? | ET fires once per state change. If you don't read to EAGAIN, remaining data never triggers another event. Loop: while((n=read(fd,buf,sz))>0){…} if(errno!=EAGAIN){error} |
| What problem does EPOLLONESHOT solve? | Prevents two threads from processing the same fd simultaneously. After epoll_wait returns the fd, it's auto-disabled. Re-arm with epoll_ctl(MOD, fd, EPOLLONESHOT|…) after processing. |
| EPOLLEXCLUSIVE vs SO_REUSEPORT? | EPOLLEXCLUSIVE: one epoll_fd woken per event on shared fd (all threads share one listen socket). SO_REUSEPORT: each thread has its own listen socket; kernel load-balances accept() at the socket level. |
| epoll (reactor) vs io_uring (proactor): syscalls per request? | epoll: 3 syscalls (epoll_wait + read + write). io_uring normal: ~2 per batch. io_uring SQPOLL: 0 syscalls — kernel thread polls SQ ring. Data is in user buffer on completion, no second read needed. |
| Draw eventpoll → epitem → eppoll_entry → sk_wq chain. | eventpoll.rbr → epitem.rbn (RB-tree key={file*,fd}) + epitem.pwqlist → eppoll_entry.wait (func=ep_poll_callback) → installed in socket sk_wq. eventpoll.rdllist → epitem.rdllink (ready list). eventpoll.wq → sleeping epoll_wait callers. |
| When would you use kqueue over epoll? | On BSD/macOS; or when you need EVFILT_PROC (process exit), EVFILT_SIGNAL, EVFILT_TIMER, EVFILT_USER in a single multiplexed API. epoll only handles file descriptors. |