Part XXV — I/O Multiplexing

§ 34 — epoll Deep Dive: Data Structures · ep_poll_callback · LT/ET · EPOLLONESHOT · io_uring · Reactor/Proactor

eventpoll → rbr (RB-tree) → epitem → eppoll_entry → sk_wq chain (§34.1) · ep_poll_callback: why epoll_wait is O(k) not O(n) (§34.2) · Level Trigger vs Edge Trigger with drain loop (§34.3) · EPOLLONESHOT and EPOLLEXCLUSIVE thundering-herd fix (§34.4) · ET pitfalls: write readiness, accept loop, self-pipe (§34.5) · epoll vs kqueue vs IOCP vs io_uring syscall comparison (§34.6) · Reactor and Proactor patterns (§34.7) · Blocking → poll → epoll LT → epoll ET → io_uring echo server evolution (§34.8)

1. § 34.1 — epoll Data Structures

Three kernel structs form the backbone of every epoll instance. Understanding them is the prerequisite for every epoll interview question.

struct eventpoll — the top-level container

Created by epoll_create() and backed by an anonymous inode. The returned epfd is a regular file descriptor pointing to this struct.

Field	Type	Purpose
rbr	rb_root_cached	Root of the RB-tree holding all registered epitem structs, keyed by {file *, fd}. O(log n) insert/lookup/delete via epoll_ctl().
rdllist	list_head	Doubly-linked list of ready epitems (those with pending I/O events). ep_poll_callback appends here; epoll_wait drains here.
lock	spinlock_t	Protects rdllist from concurrent ep_poll_callback calls (softirq context) and epoll_wait draining.
sem	rw_semaphore	Protects the RB-tree during epoll_ctl() ADD/MOD/DEL operations.
wq	wait_queue_head_t	Wait queue for processes sleeping in epoll_wait(). ep_poll_callback calls wake_up(&ep→wq) to wake them.
poll_wait	wait_queue_head_t	Nested epoll: when an epoll fd is itself registered in another epoll. Used by ep_eventpoll_poll().

struct epitem — one per registered fd

Created by epoll_ctl(EPOLL_CTL_ADD). Lives in the RB-tree and optionally in the rdllist when its fd is ready.

Field	Purpose
rbn	rb_node — embeds epitem into eventpoll.rbr. Key = {file , fd} so the same fd number but different file (dup'd fds) can coexist.
rdllink	list_head — node for inclusion in eventpoll.rdllist when this fd is ready. Added by ep_poll_callback(); removed by ep_send_events().
ep	Back-pointer to parent eventpoll. Used by ep_poll_callback to find rdllist and wq.
ffd	{file *, fd} pair identifying the monitored fd. Used as RB-tree key.
event	struct epoll_event {__u32 events; __u64 data}. events = EPOLLIN\|EPOLLOUT\|… mask; data = user-supplied cookie returned on wakeup.
pwqlist	List of eppoll_entry structs. One entry per wait queue head the fd has (sockets usually have one: sk_wq).

struct eppoll_entry — the wait queue hook

The bridge between the monitored socket and the epoll instance. Its wait.func = ep_poll_callback is what fires when the socket becomes ready.

Field	Purpose
whead	wait_queue_head_t * — pointer to the socket's sk_wq.wait head where this entry is registered. Needed to remove the entry on epoll_ctl(DEL).
base	epitem * — back-pointer to the owning epitem. ep_poll_callback uses this to reach the eventpoll and rdllist.
wait	wait_queue_entry_t with .func = ep_poll_callback. This is the actual entry inserted into the socket's sk_wq. When sk_data_ready fires wake_up(), this func is called.

Full Data Structure Chain

The complete ownership chain from eventpoll down to the socket wait queue. This is what makes epoll O(1) — the callback is installed at registration time, not at wait time.

Interactive: epoll Data Structure Explorer

Add and remove fds, trigger I/O events, and call epoll_wait(). Watch the RB-tree rebalance on insert and epitems move into the rdllist on I/O events.

Demo 3504 — epoll Data Structure Explorer

eventpoll.rbr — RB-Tree (sorted by fd)

fd=7

BLACK · depth=0

fd=3

RED · depth=1

fd=12

RED · depth=1

rdllist (ready)

empty

Actions

Trigger I/O event:

DEL:

Kernel log

epoll_create1(EPOLL_CLOEXEC) → epfd=4

epoll_ctl(ADD, fd=3, EPOLLIN) → insert into rbr

epoll_ctl(ADD, fd=7, EPOLLIN) → insert into rbr

epoll_ctl(ADD, fd=12, EPOLLIN) → insert into rbr

2. § 34.2 — ep_poll_callback: The Key to O(1)

The fundamental difference between epoll and select/poll is when the kernel identifies ready fds. select/poll finds them at wakeup by scanning everything. epoll uses a callback installed at registration time — ready fds are delivered directly, without a scan.

Registration: epoll_ctl(ADD) installs the callback

Step	What happens	Data structure
alloc epitem	kmalloc(sizeof(epitem)). Fill ffd={file*,fd}, event={events,data}.	struct epitem on kernel heap
RB-tree insert	ep_rbtree_insert(): find position by {file*,fd} key, rb_link_node(), rb_insert_color().	eventpoll.rbr — O(log n)
alloc eppoll_entry	kmalloc(sizeof(eppoll_entry)). Set base=epitem, wait.func=ep_poll_callback.	struct eppoll_entry on heap
install in sk_wq	add_wait_queue(sock→sk_wq, &ee→wait). Now sk_data_ready() will call ep_poll_callback.	socket sk_wq.wait list

ep_poll_callback() — called from softirq context

This function runs in softirq/interrupt context — not in any process context. It must be fast and non-blocking.

#	Action	Why
1	key & epi→event.events → match?	Check if the event type (EPOLLIN/EPOLLOUT/…) matches what this epitem is watching. If not, return immediately.
2	spin_lock_irqsave(&ep→lock)	Protect rdllist from concurrent callbacks (multiple sockets can fire simultaneously on different CPUs).
3	list_add_tail(&epi→rdllink, &ep→rdllist)	O(1) append. This is the only work done per ready fd.
4	spin_unlock_irqrestore(&ep→lock)	Release rdllist lock.
5	wake_up(&ep→wq)	Wake the process sleeping in epoll_wait(). Only epoll's own wait queue — not a thundering herd.

epoll_wait() flow

Condition	Action
rdllist non-empty on entry	Skip sleep. Call ep_send_events(): copy up to maxevents ready events to userspace epoll_event array. Return count.
rdllist empty	Add current process to ep→wq (TASK_INTERRUPTIBLE). Call schedule_hrtimeout() with timeout. Sleep.
Woken by ep_poll_callback	Return from schedule. Re-check rdllist. Call ep_send_events().
LT re-queue	After ep_send_events(), if epitem is still ready (buffer still has data), re-add to rdllist so next epoll_wait sees it.

Full flow: NIC interrupt → epoll_wait returns

Why O(1)? — epoll vs select wakeup comparison

The critical difference: select wakes all waiting processes and each re-scans all n fds. epoll fires the callback only for the ready fd, appends to rdllist in O(1), and wakes only epoll_wait — which then drains only the k ready entries.

Why O(k) amortized:

ep_poll_callback touches only 1 epitem per ready fd — O(1) per event
epoll_wait drains only rdllist — O(k) where k = number of ready fds
The RB-tree of n fds is never touched during wakeup
If k << n (typical: 10 ready out of 10K), cost is effectively constant

Interactive: ep_poll_callback Trace — 1000 fds, 3 ready

Step through the full path: NIC interrupt → softirq → ep_poll_callback fires on exactly 3 fds → rdllist gets 3 entries → epoll_wait returns 3. 997 fds are never touched.

① Idle

② NIC interrupt

③ SoftIRQ

④ ep_poll_callback

⑤ rdllist has 3

⑥ epoll_wait returns

① Idle

1000 fds registered. epoll_wait() sleeping on ep→wq. All fds quiet.

1000 registered fds (fd=42 fd=317 fd=891 are ready)

idle NIC interrupt softirq ep_poll_callback fired

rdllist:empty

3. § 34.3 — Level Trigger (LT) vs Edge Trigger (ET)

LT and ET describe when epoll reports a fd as ready. The difference is consequential: ET requires a mandatory drain loop, and getting it wrong silently loses data.

Level Trigger (LT) — default

Property	LT behavior
Trigger condition	fd is reported as long as its buffer has data — regardless of when data arrived
Partial read safe?	YES. Read 50 of 100 bytes; next epoll_wait still returns the fd (50B remaining)
Kernel mechanism	After ep_send_events(), if epitem still ready (vfs_poll returns events), re-add to rdllist
Use case	Any read/write pattern. Compatible with blocking code that reads all available data once
Drawback	Extra epoll_wait wakeups if app consistently partial-reads. Extra rdllist re-insertion overhead

Edge Trigger (ET) — EPOLLET flag

Property	ET behavior
Trigger condition	fd is reported exactly once when a state transition occurs: buffer goes from empty → non-empty (new data arrives)
Partial read safe?	NO. Read 50 of 100 bytes → epoll will NOT notify again. 50B stuck until more data arrives.
Kernel mechanism	epitem removed from rdllist after reporting. Only re-added when ep_poll_callback fires again (new event)
Mandatory pattern	Loop read() until EAGAIN/EWOULDBLOCK. Never assume one read() drains all data.
Requirement	fd MUST be O_NONBLOCK. ET + blocking fd = deadlock when buffer is partially read
Use case	High-performance servers (nginx, Redis). Fewer epoll_wait calls per fd when you always drain.

ET drain loop — mandatory pattern

/* ET mode: must drain until EAGAIN */
while (1) {
ssize_t n = read(fd, buf, sizeof(buf));
if (n > 0) { process(buf, n); continue; }
if (n == 0) { close(fd); break; }
if (errno == EAGAIN || errno == EWOULDBLOCK) break;
/* real error */ break;
}

Scenario	LT result	ET result
100B arrives, you read 100B (drain)	Next wait: fd NOT reported (buffer empty) ✓	Next wait: fd NOT reported ✓ — same
100B arrives, you read 50B	Next wait: fd reported again (50B left) ✓	Next wait: fd NOT reported ✗ — 50B LOST
100B arrives, then 50B more before next wait	Reports ready once (150B total available)	Reports ready once per data arrival (may coalesce)
EPOLLOUT on write: send buffer not full	Reported every epoll_wait (wasteful!) — disable when not needed	Reported only when buffer transitions full→not-full

Interactive: LT vs ET Simulator

Switch between LT and ET. Send 100 bytes, read a partial amount with the slider, then call epoll_wait again — see exactly what each mode reports and what gets lost.

Socket Receive Buffer

0B unread / 0B total

rdllist: empty

Read:50B

Kernel log

Ready. Choose LT or ET, then send data.

LT guarantee:

Safe to partial-read. If data remains, fd will be reported again on next epoll_wait(). Works with any read pattern.

4. § 34.4 — EPOLLONESHOT and EPOLLEXCLUSIVE

Two flags that solve multi-threaded epoll problems: EPOLLONESHOT prevents duplicate fd delivery across threads; EPOLLEXCLUSIVE prevents thundering herd on a shared listen socket.

EPOLLONESHOT

After epoll_wait() returns an fd registered with EPOLLONESHOT, the kernel automatically disables that fd — as if EPOLL_CTL_DEL was called. No other thread can get the same fd on a subsequent epoll_wait() until you explicitly re-arm it.

Property	Detail
Problem solved	Multi-threaded server: two threads both wake from epoll_wait() for the same fd. Both call read() — race condition, corrupted state.
Mechanism	After ep_send_events() returns the fd, kernel clears EPOLLIN\|EPOLLOUT from epi→event.events. fd effectively disabled.
Re-arm	epoll_ctl(EPOLL_CTL_MOD, fd, EPOLLONESHOT \| EPOLLIN) after processing. Adds cost of one extra syscall per event.
Alternative	Per-thread epoll fds + EPOLLEXCLUSIVE (avoids re-arm overhead for accept loops).

EPOLLEXCLUSIVE (Linux 4.5+)

When multiple threads each have their own epoll fd, all watching the same listen socket, a new connection wakes all of them — the classic thundering herd. EPOLLEXCLUSIVE uses WQ_FLAG_EXCLUSIVE in the wait queue entry so that only one epoll fd's epoll_wait() is woken per event.

Property	Detail
Problem solved	4 threads × 4 epoll fds, all watching listen socket fd=3. New connection: all 4 epoll_wait() return. 3 threads get EAGAIN from accept().
Mechanism	add_wait_queue_exclusive(): sets WQ_FLAG_EXCLUSIVE on the eppoll_entry. wake_up() stops after waking the first exclusive waiter.
Which thread wakes?	Unspecified (first in the wait queue list). Kernel may rotate over time but no guarantee of fairness.
vs SO_REUSEPORT	SO_REUSEPORT: each thread has its own listen socket; kernel load-balances at accept() level. EPOLLEXCLUSIVE: one shared listen socket, one epoll fd woken. SO_REUSEPORT is generally preferred for better load distribution.
Restriction	Cannot combine EPOLLEXCLUSIVE with EPOLLONESHOT. Cannot use with epoll_ctl(MOD). Only valid at ADD time.

Comparison: EPOLLONESHOT vs EPOLLEXCLUSIVE vs SO_REUSEPORT

Mechanism	Problem	Granularity	Extra cost
EPOLLONESHOT	One thread processes one fd event	Per-fd event	epoll_ctl(MOD) re-arm per event
EPOLLEXCLUSIVE	One epoll_wait woken per listen event	Per-epoll-fd	None after setup
SO_REUSEPORT	Kernel distributes connections across sockets	Per-connection	Extra listen socket per thread

Interactive: Thundering Herd Demonstration

Toggle EPOLLEXCLUSIVE and send connections to see the difference. Without it, all 4 threads wake and 3 waste a context switch on EAGAIN.

Worker threads — all blocking on same listen socket fd=3

Thread 1

epoll_wait() — sleeping

0 syscalls

Thread 2

epoll_wait() — sleeping

0 syscalls

Thread 3

epoll_wait() — sleeping

0 syscalls

Thread 4

epoll_wait() — sleeping

0 syscalls

Stats

✓ accepted: 0

✗ EAGAIN (wasted): 0

log

4 threads each blocking in epoll_wait() on the listen socket.

5. § 34.5 — ET Drain Rule and Common Pitfalls

ET mode is faster than LT in steady state, but it has sharp edges. All of these pitfalls cause silent data loss or hangs — no error is returned.

The Drain Rule

Rule:With ET, always loop read() until EAGAIN / EWOULDBLOCK. Every call. No exceptions. If you return early, data sits in the buffer silently — epoll will not notify you again until new data arrives (a new state transition).

Pitfall 1 — Read pitfall (most common)

Wrong	Correct
// ET — reads only once n = read(fd, buf, sz); process(buf, n); // done — WRONG: may have more data	// ET — drain until EAGAIN while ((n = read(fd, buf, sz)) > 0) process(buf, n); if (errno != EAGAIN) close(fd);

Pitfall 2 — EPOLLOUT write readiness

EPOLLOUT in ET mode fires when the send buffer transitions from full → not full (state change). If you register EPOLLOUT and the buffer is never full, it fires on everyepoll_wait() — a busy loop. Pattern: enable EPOLLOUT only when a write returns EAGAIN (buffer full); disable it once writes succeed.

Pitfall 3 — Accept loop on ET listen socket

Scenario	Result
ET listen socket, 5 connections queued, accept() once	4 connections remain in backlog. No more EPOLLIN until the 6th connection arrives. accept() those 4 — never!
ET listen socket, drain loop: while(accept()!=-1\|\|errno==EINTR)	All 5 connections accepted. EAGAIN breaks the loop. Correct.
LT listen socket, accept() once	Safe: next epoll_wait still reports listen fd as long as backlog non-empty.

Pitfall 4 — ET + blocking fd = hang

ET requires O_NONBLOCK. If you use a blocking fd with ET, the drain loop calls read() when the buffer is empty — the call blocks indefinitely instead of returning EAGAIN. The thread hangs. Always set fcntl(fd, F_SETFL, O_NONBLOCK) before registering with EPOLLET.

Pitfall 5 — Self-pipe trick (pre-eventfd)

Before eventfd() and timerfd(), waking epoll_wait() from another thread required a pipe: register pipefd[0] with EPOLLIN, write one byte to pipefd[1] to trigger the wakeup. Now use eventfd(0, EFD_NONBLOCK) instead — lighter and correct with both LT and ET.

Interactive: ET Drain Pitfall — 3 data bursts

Step through 3 data bursts in WRONG mode (read once per event) vs CORRECT mode (drain loop). Watch the buffer state and see exactly how much data is lost.

Burst 1 arrives (400B)

read() once — wrong!

Burst 2 arrives (300B)

read() once — wrong!

Burst 3 arrives (500B)

read() once — wrong!

Result: 944B lost

Socket receive buffer — 3 bursts

Burst 1 (400B)400B lost

Burst 2 (300B)300B lost

Burst 3 (500B)500B lost

processed: 0B / 1200B

LOST in buffer: 1200B

log

Choose mode and step through.

6. § 34.6 — epoll vs kqueue vs IOCP vs io_uring

epoll is Linux-only and readiness-based (reactor). Other platforms and newer Linux APIs offer different trade-offs — from the general-purpose kqueue to the truly async io_uring.

epoll vs kqueue

Feature	epoll	kqueue
Platform	Linux 2.6+	BSD, macOS, iOS
Change + wait	Separate: epoll_ctl() then epoll_wait()	Single kevent() call for both
Event types	fds only (sockets, pipes, eventfd, timerfd, signalfd)	fds + processes + signals + timers + user events
Edge trigger	EPOLLET flag	EV_CLEAR flag (EV_EOF separate)
Batch changes	No (one epoll_ctl per fd)	Yes: changelist[] in kevent()
Process monitoring	No (needs waitpid)	EVFILT_PROC: note process exit/fork
Timer events	timerfd_create() workaround	EVFILT_TIMER built-in

IOCP (Windows) — Proactor

IOCP is fundamentally different: the kernel completes I/O into your buffer, then posts a completion packet. You never call read() — the data is already there when your handler runs. This is the proactor pattern.

Property	epoll (reactor)	IOCP (proactor)
Notification	'fd is ready to read'	'I/O is complete, data in your buffer'
Who reads data	Application calls read()	Kernel reads directly into app buffer
Phases per request	2: wait + read	1: completion callback
Buffer ownership	App allocates after notification	App allocates before submission
Concurrency model	Reactor: handle ready fds	Thread pool drains completion port

io_uring (Linux 5.1+) — Async Ring

io_uring uses two shared memory rings between kernel and userspace. Submit I/O without a syscall; collect completions without a syscall. With SQPOLL, even submission is zero-syscall — a kernel thread polls the SQ ring continuously.

Feature	Detail
Submission Queue (SQ)	mmap'd ring. App writes SQE (io_uring_sqe) at tail. Kernel reads from head.
Completion Queue (CQ)	mmap'd ring. Kernel writes CQE (io_uring_cqe) at tail. App reads from head.
Zero-syscall submit	io_uring_prep_recv() writes SQE directly — no syscall. Batch with io_uring_submit() (1 syscall per batch).
SQPOLL	io_uring_setup(IORING_SETUP_SQPOLL): kernel thread polls SQ. io_uring_enter() not needed for submit — 0 syscalls.
Fixed buffers	io_uring_register(IORING_REGISTER_BUFFERS): pre-pin user pages. Avoids page-fault on each I/O.
Registered fds	io_uring_register(IORING_REGISTER_FILES): kernel holds file references. Faster than fd lookup per SQE.
io_uring vs epoll	epoll: 3 syscalls/request (epoll_wait+read+write). io_uring SQPOLL: ~0 syscalls/request.

Quick API Comparison

API	OS	Model	Syscalls/req (steady)	Data in buf at callback?
select/poll	POSIX	Reactor	3+ (select+read+write)	No — you call read()
epoll	Linux	Reactor	3 (wait+read+write)	No — you call read()
kqueue	BSD/macOS	Reactor	2 (kevent+read+write) or 3	No — you call read()
IOCP	Windows	Proactor	1 (GetQueuedCS)	YES — kernel filled it
io_uring (normal)	Linux 5.1+	Proactor	~1-2 per batch	YES — kernel filled it
io_uring (SQPOLL)	Linux 5.1+	Proactor	~0	YES — kernel filled it

Demo 3509 — Syscall Count: epoll vs io_uring

Run the simulation to see exact syscall counts at various connection scales. The SQPOLL column drops to near-zero — that is the fundamental advantage of the shared ring model.

epoll vs io_uring — syscall count comparison at N=1000 conn, R=100 req — C Demo

stdin (optional)

7. § 34.7 — Reactor vs Proactor Pattern

epoll implements the reactor pattern: the kernel tells you a fd is ready, then your code calls read(). io_uring and IOCP implement the proactor pattern: the kernel completes I/O into your buffer and notifies you when it's done — no read() needed.

Reactor Pattern (epoll)

Proactor Pattern (io_uring)

Side-by-side Loop Comparison

Property	Reactor (epoll)	Proactor (io_uring / IOCP)
Notification type	fd is ready to read/write	I/O is complete, data in your buffer
Who calls read()	Application (2nd phase)	Kernel (during submission)
Phases per request	2: epoll_wait → read()	1: wait for CQE
Buffer timing	Allocated after event fires	Pre-allocated before submission
Syscalls per request	3 (epoll_wait + read + write)	0–2 (SQPOLL: ~0)
Cancellation	epoll_ctl(DEL)	io_uring_prep_cancel(sqe)
Kernel support	Linux 2.6+ (universal)	Linux 5.1+ (5.19+ for stability)

Reactor Event Loop — Step by Step

Step	Call	What happens
1	epoll_ctl(ADD, fd, EPOLLIN)	Kernel inserts epitem into RB-tree; registers ep_poll_callback on fd's wait queue
2	epoll_wait(epfd, events, N, -1)	Thread sleeps in epoll's wait queue until rdllist non-empty
3	ep_poll_callback fires (softirq)	Moves epitem to rdllist; wakes epoll_wait thread
4	epoll_wait returns k events	Copies up to N epoll_events to userspace
5	Handler: read(fd, buf, len)	App drains socket buffer (ET: drain loop until EAGAIN)
6	Handler: write(fd, resp, n)	App sends response
7a (LT)	Loop back to epoll_wait	fd stays armed — epoll re-reports if data remains
7b (ONESHOT)	epoll_ctl(MOD, EPOLLIN\|EPOLLONESHOT)	Re-arm: fd was auto-disabled after first event delivery

Demo 3510 — Reactor Event Loop Simulator

Add connections, ARM them into epoll, then run epoll_wait to watch the dispatch cycle. Toggle EPOLLONESHOT to see the re-arm step appear after each write.

EPOLLONESHOTLoop #0
fd=10
IDLE
fd=11
IDLE
fd=12
IDLE
Event loop ready. ARM connections then run epoll_wait.
IN epollread()write()re-arm

8. § 34.8 — Echo Server Evolution

The echo server is the canonical benchmark for I/O multiplexing. Each stage removes a bottleneck from the previous one — from blocking single-client all the way to io_uring's zero-syscall proactor.

Stage	Model	Max conns	Syscalls/req	fd scan	Key limitation
blocking	sequential/O(n)	1 client	3 (accept+read+write)	no — O(k)	Second client hangs until first disconnects
poll	sequential/O(n)	~hundreds	4+ (poll+accept/read+write)	YES — O(n)	poll() scans all fds even if 1 is ready — degrades at scale
epoll-lt	reactor O(k)	~10k–100k	3 (epoll_wait+read+write)	no — O(k)	LT: fd stays in rdllist if data left — safe but extra wakeups
epoll-et	reactor O(k)	~100k+	3 + drain loop	no — O(k)	Must drain fully to EAGAIN; missing data if loop exits early
io-uring	proactor	1M+	~0 (SQPOLL)	no — O(k)	Linux 5.1+ required; complex setup; SQPOLL needs root or CAP_SYS_NICE

Stage Explorer

Click a stage in the table above or use the tabs below to explore each implementation.

Stage 1: Blocking Single-Client

Capacity

1 client

Syscalls/req

3 (accept+read+write)

CPU

100% blocked in read()

/* Stage 1: blocking single-client echo server */
int main(void) {
    int lfd = tcp_listen(8080);

    for (;;) {
        /* blocks here — only ONE client at a time */
        int cfd = accept(lfd, NULL, NULL);

        char buf[4096];
        ssize_t n;
        /* blocks in read() — server frozen for all others */
        while ((n = read(cfd, buf, sizeof buf)) > 0)
            write(cfd, buf, n);

        close(cfd);
        /* next accept() — previous client must be gone */
    }
}
/* syscalls per request: accept(1) + read(1) + write(1) = 3
   concurrency: 1 (second client waits in kernel accept queue) */

Limitation:Second client hangs until first disconnects

Stage	Key design decision	What it unlocks
blocking	accept() + read() in sequence	Simple; works for 1 client
poll()	Single call multiplexes N fds	Multiple clients; O(n) cost
epoll LT	Kernel tracks readiness in RB-tree + rdllist	O(k) wakeup; scales to 100k
epoll ET	Only notify on state change; drain loop	Fewer wakeups; lower CPU at high load
io_uring SQPOLL	Kernel completes I/O; shared mmap rings	~0 syscalls/req; 1M+ connections

Demo 3511 — Echo Server Evolution: Syscall & Capacity Comparison

Run to see the quantitative breakdown: syscalls per second at C10K scale and the O(n) cost of poll() vs O(k) of epoll at 10,000 connections.

Echo server evolution — syscall count and capacity comparison — C Demo

stdin (optional)

Interview Prep — Must-Know Questions

Question	Key answer
What is ep_poll_callback? When is it called?	Installed as wait_queue_entry_t.func in the socket's sk_wq. Called from softirq/interrupt context when sk_data_ready() fires. Adds epitem to eventpoll.rdllist (O(1)) and wakes epoll_wait.
Why is epoll_wait O(k) not O(n)?	ep_poll_callback adds only ready epitems to rdllist. epoll_wait drains rdllist — touches exactly k ready fds, never the other n−k. The RB-tree is never scanned on wakeup.
LT vs ET: what happens if you read only 50 of 100 bytes?	LT: fd is re-added to rdllist — epoll_wait reports it again next call. ET: no re-add — 50 bytes sit silently in buffer until new data arrives. Mandatory: drain loop until EAGAIN.
ET drain loop — why mandatory?	ET fires once per state change. If you don't read to EAGAIN, remaining data never triggers another event. Loop: while((n=read(fd,buf,sz))>0){…} if(errno!=EAGAIN){error}
What problem does EPOLLONESHOT solve?	Prevents two threads from processing the same fd simultaneously. After epoll_wait returns the fd, it's auto-disabled. Re-arm with epoll_ctl(MOD, fd, EPOLLONESHOT\|…) after processing.
EPOLLEXCLUSIVE vs SO_REUSEPORT?	EPOLLEXCLUSIVE: one epoll_fd woken per event on shared fd (all threads share one listen socket). SO_REUSEPORT: each thread has its own listen socket; kernel load-balances accept() at the socket level.
epoll (reactor) vs io_uring (proactor): syscalls per request?	epoll: 3 syscalls (epoll_wait + read + write). io_uring normal: ~2 per batch. io_uring SQPOLL: 0 syscalls — kernel thread polls SQ ring. Data is in user buffer on completion, no second read needed.
Draw eventpoll → epitem → eppoll_entry → sk_wq chain.	eventpoll.rbr → epitem.rbn (RB-tree key={file*,fd}) + epitem.pwqlist → eppoll_entry.wait (func=ep_poll_callback) → installed in socket sk_wq. eventpoll.rdllist → epitem.rdllink (ready list). eventpoll.wq → sleeping epoll_wait callers.
When would you use kqueue over epoll?	On BSD/macOS; or when you need EVFILT_PROC (process exit), EVFILT_SIGNAL, EVFILT_TIMER, EVFILT_USER in a single multiplexed API. epoll only handles file descriptors.