Part XXV — I/O Multiplexing

§ 34 — epoll Deep Dive: Data Structures · ep_poll_callback · LT/ET · EPOLLONESHOT · io_uring · Reactor/Proactor

eventpoll → rbr (RB-tree) → epitem → eppoll_entry → sk_wq chain (§34.1) · ep_poll_callback: why epoll_wait is O(k) not O(n) (§34.2) · Level Trigger vs Edge Trigger with drain loop (§34.3) · EPOLLONESHOT and EPOLLEXCLUSIVE thundering-herd fix (§34.4) · ET pitfalls: write readiness, accept loop, self-pipe (§34.5) · epoll vs kqueue vs IOCP vs io_uring syscall comparison (§34.6) · Reactor and Proactor patterns (§34.7) · Blocking → poll → epoll LT → epoll ET → io_uring echo server evolution (§34.8)

1. § 34.1 — epoll Data Structures

Three kernel structs form the backbone of every epoll instance. Understanding them is the prerequisite for every epoll interview question.

struct eventpoll — the top-level container

Created by epoll_create() and backed by an anonymous inode. The returned epfd is a regular file descriptor pointing to this struct.

FieldTypePurpose
rbrrb_root_cachedRoot of the RB-tree holding all registered epitem structs, keyed by {file *, fd}. O(log n) insert/lookup/delete via epoll_ctl().
rdllistlist_headDoubly-linked list of ready epitems (those with pending I/O events). ep_poll_callback appends here; epoll_wait drains here.
lockspinlock_tProtects rdllist from concurrent ep_poll_callback calls (softirq context) and epoll_wait draining.
semrw_semaphoreProtects the RB-tree during epoll_ctl() ADD/MOD/DEL operations.
wqwait_queue_head_tWait queue for processes sleeping in epoll_wait(). ep_poll_callback calls wake_up(&ep→wq) to wake them.
poll_waitwait_queue_head_tNested epoll: when an epoll fd is itself registered in another epoll. Used by ep_eventpoll_poll().

struct epitem — one per registered fd

Created by epoll_ctl(EPOLL_CTL_ADD). Lives in the RB-tree and optionally in the rdllist when its fd is ready.

FieldPurpose
rbnrb_node — embeds epitem into eventpoll.rbr. Key = {file *, fd} so the same fd number but different file* (dup'd fds) can coexist.
rdllinklist_head — node for inclusion in eventpoll.rdllist when this fd is ready. Added by ep_poll_callback(); removed by ep_send_events().
epBack-pointer to parent eventpoll. Used by ep_poll_callback to find rdllist and wq.
ffd{file *, fd} pair identifying the monitored fd. Used as RB-tree key.
eventstruct epoll_event {__u32 events; __u64 data}. events = EPOLLIN|EPOLLOUT|… mask; data = user-supplied cookie returned on wakeup.
pwqlistList of eppoll_entry structs. One entry per wait queue head the fd has (sockets usually have one: sk_wq).

struct eppoll_entry — the wait queue hook

The bridge between the monitored socket and the epoll instance. Its wait.func = ep_poll_callback is what fires when the socket becomes ready.

FieldPurpose
wheadwait_queue_head_t * — pointer to the socket's sk_wq.wait head where this entry is registered. Needed to remove the entry on epoll_ctl(DEL).
baseepitem * — back-pointer to the owning epitem. ep_poll_callback uses this to reach the eventpoll and rdllist.
waitwait_queue_entry_t with .func = ep_poll_callback. This is the actual entry inserted into the socket's sk_wq. When sk_data_ready fires wake_up(), this func is called.

Full Data Structure Chain

The complete ownership chain from eventpoll down to the socket wait queue. This is what makes epoll O(1) — the callback is installed at registration time, not at wait time.

Interactive: epoll Data Structure Explorer

Add and remove fds, trigger I/O events, and call epoll_wait(). Watch the RB-tree rebalance on insert and epitems move into the rdllist on I/O events.

Demo 3504 — epoll Data Structure Explorer
eventpoll.rbr — RB-Tree (sorted by fd)
fd=7
BLACK · depth=0
fd=3
RED · depth=1
fd=12
RED · depth=1
rdllist (ready)
empty
Actions
Trigger I/O event:
DEL:
Kernel log
epoll_create1(EPOLL_CLOEXEC) → epfd=4
epoll_ctl(ADD, fd=3, EPOLLIN) → insert into rbr
epoll_ctl(ADD, fd=7, EPOLLIN) → insert into rbr
epoll_ctl(ADD, fd=12, EPOLLIN) → insert into rbr

2. § 34.2 — ep_poll_callback: The Key to O(1)

The fundamental difference between epoll and select/poll is when the kernel identifies ready fds. select/poll finds them at wakeup by scanning everything. epoll uses a callback installed at registration time — ready fds are delivered directly, without a scan.

Registration: epoll_ctl(ADD) installs the callback

StepWhat happensData structure
alloc epitemkmalloc(sizeof(epitem)). Fill ffd={file*,fd}, event={events,data}.struct epitem on kernel heap
RB-tree insertep_rbtree_insert(): find position by {file*,fd} key, rb_link_node(), rb_insert_color().eventpoll.rbr — O(log n)
alloc eppoll_entrykmalloc(sizeof(eppoll_entry)). Set base=epitem, wait.func=ep_poll_callback.struct eppoll_entry on heap
install in sk_wqadd_wait_queue(sock→sk_wq, &ee→wait). Now sk_data_ready() will call ep_poll_callback.socket sk_wq.wait list

ep_poll_callback() — called from softirq context

This function runs in softirq/interrupt context — not in any process context. It must be fast and non-blocking.

#ActionWhy
1key & epi→event.events → match?Check if the event type (EPOLLIN/EPOLLOUT/…) matches what this epitem is watching. If not, return immediately.
2spin_lock_irqsave(&ep→lock)Protect rdllist from concurrent callbacks (multiple sockets can fire simultaneously on different CPUs).
3list_add_tail(&epi→rdllink, &ep→rdllist)O(1) append. This is the only work done per ready fd.
4spin_unlock_irqrestore(&ep→lock)Release rdllist lock.
5wake_up(&ep→wq)Wake the process sleeping in epoll_wait(). Only epoll's own wait queue — not a thundering herd.

epoll_wait() flow

ConditionAction
rdllist non-empty on entrySkip sleep. Call ep_send_events(): copy up to maxevents ready events to userspace epoll_event array. Return count.
rdllist emptyAdd current process to ep→wq (TASK_INTERRUPTIBLE). Call schedule_hrtimeout() with timeout. Sleep.
Woken by ep_poll_callbackReturn from schedule. Re-check rdllist. Call ep_send_events().
LT re-queueAfter ep_send_events(), if epitem is still ready (buffer still has data), re-add to rdllist so next epoll_wait sees it.

Full flow: NIC interrupt → epoll_wait returns

Why O(1)? — epoll vs select wakeup comparison

The critical difference: select wakes all waiting processes and each re-scans all n fds. epoll fires the callback only for the ready fd, appends to rdllist in O(1), and wakes only epoll_wait — which then drains only the k ready entries.

Why O(k) amortized:
  • ep_poll_callback touches only 1 epitem per ready fd — O(1) per event
  • epoll_wait drains only rdllist — O(k) where k = number of ready fds
  • The RB-tree of n fds is never touched during wakeup
  • If k << n (typical: 10 ready out of 10K), cost is effectively constant

Interactive: ep_poll_callback Trace — 1000 fds, 3 ready

Step through the full path: NIC interrupt → softirq → ep_poll_callback fires on exactly 3 fds → rdllist gets 3 entries → epoll_wait returns 3. 997 fds are never touched.

① Idle
② NIC interrupt
③ SoftIRQ
④ ep_poll_callback
⑤ rdllist has 3
⑥ epoll_wait returns
① Idle
1000 fds registered. epoll_wait() sleeping on ep→wq. All fds quiet.
1000 registered fds (fd=42 fd=317 fd=891 are ready)
idle NIC interrupt softirq ep_poll_callback fired
rdllist:empty

3. § 34.3 — Level Trigger (LT) vs Edge Trigger (ET)

LT and ET describe when epoll reports a fd as ready. The difference is consequential: ET requires a mandatory drain loop, and getting it wrong silently loses data.

Level Trigger (LT) — default

PropertyLT behavior
Trigger conditionfd is reported as long as its buffer has data — regardless of when data arrived
Partial read safe?YES. Read 50 of 100 bytes; next epoll_wait still returns the fd (50B remaining)
Kernel mechanismAfter ep_send_events(), if epitem still ready (vfs_poll returns events), re-add to rdllist
Use caseAny read/write pattern. Compatible with blocking code that reads all available data once
DrawbackExtra epoll_wait wakeups if app consistently partial-reads. Extra rdllist re-insertion overhead

Edge Trigger (ET) — EPOLLET flag

PropertyET behavior
Trigger conditionfd is reported exactly once when a state transition occurs: buffer goes from empty → non-empty (new data arrives)
Partial read safe?NO. Read 50 of 100 bytes → epoll will NOT notify again. 50B stuck until more data arrives.
Kernel mechanismepitem removed from rdllist after reporting. Only re-added when ep_poll_callback fires again (new event)
Mandatory patternLoop read() until EAGAIN/EWOULDBLOCK. Never assume one read() drains all data.
Requirementfd MUST be O_NONBLOCK. ET + blocking fd = deadlock when buffer is partially read
Use caseHigh-performance servers (nginx, Redis). Fewer epoll_wait calls per fd when you always drain.

ET drain loop — mandatory pattern

/* ET mode: must drain until EAGAIN */
while (1) {
ssize_t n = read(fd, buf, sizeof(buf));
if (n > 0) { process(buf, n); continue; }
if (n == 0) { close(fd); break; }
if (errno == EAGAIN || errno == EWOULDBLOCK) break;
/* real error */ break;
}
ScenarioLT resultET result
100B arrives, you read 100B (drain)Next wait: fd NOT reported (buffer empty) ✓Next wait: fd NOT reported ✓ — same
100B arrives, you read 50BNext wait: fd reported again (50B left) ✓Next wait: fd NOT reported ✗ — 50B LOST
100B arrives, then 50B more before next waitReports ready once (150B total available)Reports ready once per data arrival (may coalesce)
EPOLLOUT on write: send buffer not fullReported every epoll_wait (wasteful!) — disable when not neededReported only when buffer transitions full→not-full

Interactive: LT vs ET Simulator

Switch between LT and ET. Send 100 bytes, read a partial amount with the slider, then call epoll_wait again — see exactly what each mode reports and what gets lost.

Socket Receive Buffer
0B unread / 0B total
rdllist: empty
50B
Kernel log
Ready. Choose LT or ET, then send data.
LT guarantee:
Safe to partial-read. If data remains, fd will be reported again on next epoll_wait(). Works with any read pattern.

4. § 34.4 — EPOLLONESHOT and EPOLLEXCLUSIVE

Two flags that solve multi-threaded epoll problems: EPOLLONESHOT prevents duplicate fd delivery across threads; EPOLLEXCLUSIVE prevents thundering herd on a shared listen socket.

EPOLLONESHOT

After epoll_wait() returns an fd registered with EPOLLONESHOT, the kernel automatically disables that fd — as if EPOLL_CTL_DEL was called. No other thread can get the same fd on a subsequent epoll_wait() until you explicitly re-arm it.

PropertyDetail
Problem solvedMulti-threaded server: two threads both wake from epoll_wait() for the same fd. Both call read() — race condition, corrupted state.
MechanismAfter ep_send_events() returns the fd, kernel clears EPOLLIN|EPOLLOUT from epi→event.events. fd effectively disabled.
Re-armepoll_ctl(EPOLL_CTL_MOD, fd, EPOLLONESHOT | EPOLLIN) after processing. Adds cost of one extra syscall per event.
AlternativePer-thread epoll fds + EPOLLEXCLUSIVE (avoids re-arm overhead for accept loops).

EPOLLEXCLUSIVE (Linux 4.5+)

When multiple threads each have their own epoll fd, all watching the same listen socket, a new connection wakes all of them — the classic thundering herd. EPOLLEXCLUSIVE uses WQ_FLAG_EXCLUSIVE in the wait queue entry so that only one epoll fd's epoll_wait() is woken per event.

PropertyDetail
Problem solved4 threads × 4 epoll fds, all watching listen socket fd=3. New connection: all 4 epoll_wait() return. 3 threads get EAGAIN from accept().
Mechanismadd_wait_queue_exclusive(): sets WQ_FLAG_EXCLUSIVE on the eppoll_entry. wake_up() stops after waking the first exclusive waiter.
Which thread wakes?Unspecified (first in the wait queue list). Kernel may rotate over time but no guarantee of fairness.
vs SO_REUSEPORTSO_REUSEPORT: each thread has its own listen socket; kernel load-balances at accept() level. EPOLLEXCLUSIVE: one shared listen socket, one epoll fd woken. SO_REUSEPORT is generally preferred for better load distribution.
RestrictionCannot combine EPOLLEXCLUSIVE with EPOLLONESHOT. Cannot use with epoll_ctl(MOD). Only valid at ADD time.

Comparison: EPOLLONESHOT vs EPOLLEXCLUSIVE vs SO_REUSEPORT

MechanismProblemGranularityExtra cost
EPOLLONESHOTOne thread processes one fd eventPer-fd eventepoll_ctl(MOD) re-arm per event
EPOLLEXCLUSIVEOne epoll_wait woken per listen eventPer-epoll-fdNone after setup
SO_REUSEPORTKernel distributes connections across socketsPer-connectionExtra listen socket per thread

Interactive: Thundering Herd Demonstration

Toggle EPOLLEXCLUSIVE and send connections to see the difference. Without it, all 4 threads wake and 3 waste a context switch on EAGAIN.

Worker threads — all blocking on same listen socket fd=3
Thread 1
epoll_wait() — sleeping
0 syscalls
Thread 2
epoll_wait() — sleeping
0 syscalls
Thread 3
epoll_wait() — sleeping
0 syscalls
Thread 4
epoll_wait() — sleeping
0 syscalls
Stats
✓ accepted: 0
✗ EAGAIN (wasted): 0
log
4 threads each blocking in epoll_wait() on the listen socket.

5. § 34.5 — ET Drain Rule and Common Pitfalls

ET mode is faster than LT in steady state, but it has sharp edges. All of these pitfalls cause silent data loss or hangs — no error is returned.

The Drain Rule

Rule:With ET, always loop read() until EAGAIN / EWOULDBLOCK. Every call. No exceptions. If you return early, data sits in the buffer silently — epoll will not notify you again until new data arrives (a new state transition).

Pitfall 1 — Read pitfall (most common)

WrongCorrect
// ET — reads only once n = read(fd, buf, sz); process(buf, n); // done — WRONG: may have more data// ET — drain until EAGAIN while ((n = read(fd, buf, sz)) > 0) process(buf, n); if (errno != EAGAIN) close(fd);

Pitfall 2 — EPOLLOUT write readiness

EPOLLOUT in ET mode fires when the send buffer transitions from full → not full (state change). If you register EPOLLOUT and the buffer is never full, it fires on everyepoll_wait() — a busy loop. Pattern: enable EPOLLOUT only when a write returns EAGAIN (buffer full); disable it once writes succeed.

Pitfall 3 — Accept loop on ET listen socket

ScenarioResult
ET listen socket, 5 connections queued, accept() once4 connections remain in backlog. No more EPOLLIN until the 6th connection arrives. accept() those 4 — never!
ET listen socket, drain loop: while(accept()!=-1||errno==EINTR)All 5 connections accepted. EAGAIN breaks the loop. Correct.
LT listen socket, accept() onceSafe: next epoll_wait still reports listen fd as long as backlog non-empty.

Pitfall 4 — ET + blocking fd = hang

ET requires O_NONBLOCK. If you use a blocking fd with ET, the drain loop calls read() when the buffer is empty — the call blocks indefinitely instead of returning EAGAIN. The thread hangs. Always set fcntl(fd, F_SETFL, O_NONBLOCK) before registering with EPOLLET.

Pitfall 5 — Self-pipe trick (pre-eventfd)

Before eventfd() and timerfd(), waking epoll_wait() from another thread required a pipe: register pipefd[0] with EPOLLIN, write one byte to pipefd[1] to trigger the wakeup. Now use eventfd(0, EFD_NONBLOCK) instead — lighter and correct with both LT and ET.

Interactive: ET Drain Pitfall — 3 data bursts

Step through 3 data bursts in WRONG mode (read once per event) vs CORRECT mode (drain loop). Watch the buffer state and see exactly how much data is lost.

Burst 1 arrives (400B)
read() once — wrong!
Burst 2 arrives (300B)
read() once — wrong!
Burst 3 arrives (500B)
read() once — wrong!
Result: 944B lost
Socket receive buffer — 3 bursts
Burst 1 (400B)400B lost
Burst 2 (300B)300B lost
Burst 3 (500B)500B lost
processed: 0B / 1200B
LOST in buffer: 1200B
log
Choose mode and step through.

6. § 34.6 — epoll vs kqueue vs IOCP vs io_uring

epoll is Linux-only and readiness-based (reactor). Other platforms and newer Linux APIs offer different trade-offs — from the general-purpose kqueue to the truly async io_uring.

epoll vs kqueue

Featureepollkqueue
PlatformLinux 2.6+BSD, macOS, iOS
Change + waitSeparate: epoll_ctl() then epoll_wait()Single kevent() call for both
Event typesfds only (sockets, pipes, eventfd, timerfd, signalfd)fds + processes + signals + timers + user events
Edge triggerEPOLLET flagEV_CLEAR flag (EV_EOF separate)
Batch changesNo (one epoll_ctl per fd)Yes: changelist[] in kevent()
Process monitoringNo (needs waitpid)EVFILT_PROC: note process exit/fork
Timer eventstimerfd_create() workaroundEVFILT_TIMER built-in

IOCP (Windows) — Proactor

IOCP is fundamentally different: the kernel completes I/O into your buffer, then posts a completion packet. You never call read() — the data is already there when your handler runs. This is the proactor pattern.

Propertyepoll (reactor)IOCP (proactor)
Notification'fd is ready to read''I/O is complete, data in your buffer'
Who reads dataApplication calls read()Kernel reads directly into app buffer
Phases per request2: wait + read1: completion callback
Buffer ownershipApp allocates after notificationApp allocates before submission
Concurrency modelReactor: handle ready fdsThread pool drains completion port

io_uring (Linux 5.1+) — Async Ring

io_uring uses two shared memory rings between kernel and userspace. Submit I/O without a syscall; collect completions without a syscall. With SQPOLL, even submission is zero-syscall — a kernel thread polls the SQ ring continuously.

FeatureDetail
Submission Queue (SQ)mmap'd ring. App writes SQE (io_uring_sqe) at tail. Kernel reads from head.
Completion Queue (CQ)mmap'd ring. Kernel writes CQE (io_uring_cqe) at tail. App reads from head.
Zero-syscall submitio_uring_prep_recv() writes SQE directly — no syscall. Batch with io_uring_submit() (1 syscall per batch).
SQPOLLio_uring_setup(IORING_SETUP_SQPOLL): kernel thread polls SQ. io_uring_enter() not needed for submit — 0 syscalls.
Fixed buffersio_uring_register(IORING_REGISTER_BUFFERS): pre-pin user pages. Avoids page-fault on each I/O.
Registered fdsio_uring_register(IORING_REGISTER_FILES): kernel holds file references. Faster than fd lookup per SQE.
io_uring vs epollepoll: 3 syscalls/request (epoll_wait+read+write). io_uring SQPOLL: ~0 syscalls/request.

Quick API Comparison

APIOSModelSyscalls/req (steady)Data in buf at callback?
select/pollPOSIXReactor3+ (select+read+write)No — you call read()
epollLinuxReactor3 (wait+read+write)No — you call read()
kqueueBSD/macOSReactor2 (kevent+read+write) or 3No — you call read()
IOCPWindowsProactor1 (GetQueuedCS)YES — kernel filled it
io_uring (normal)Linux 5.1+Proactor~1-2 per batchYES — kernel filled it
io_uring (SQPOLL)Linux 5.1+Proactor~0YES — kernel filled it

Demo 3509 — Syscall Count: epoll vs io_uring

Run the simulation to see exact syscall counts at various connection scales. The SQPOLL column drops to near-zero — that is the fundamental advantage of the shared ring model.

epoll vs io_uring — syscall count comparison at N=1000 conn, R=100 req — C Demo
stdin (optional)

7. § 34.7 — Reactor vs Proactor Pattern

epoll implements the reactor pattern: the kernel tells you a fd is ready, then your code calls read(). io_uring and IOCP implement the proactor pattern: the kernel completes I/O into your buffer and notifies you when it's done — no read() needed.

Reactor Pattern (epoll)

Proactor Pattern (io_uring)

Side-by-side Loop Comparison

PropertyReactor (epoll)Proactor (io_uring / IOCP)
Notification typefd is ready to read/writeI/O is complete, data in your buffer
Who calls read()Application (2nd phase)Kernel (during submission)
Phases per request2: epoll_wait → read()1: wait for CQE
Buffer timingAllocated after event firesPre-allocated before submission
Syscalls per request3 (epoll_wait + read + write)0–2 (SQPOLL: ~0)
Cancellationepoll_ctl(DEL)io_uring_prep_cancel(sqe)
Kernel supportLinux 2.6+ (universal)Linux 5.1+ (5.19+ for stability)

Reactor Event Loop — Step by Step

StepCallWhat happens
1epoll_ctl(ADD, fd, EPOLLIN)Kernel inserts epitem into RB-tree; registers ep_poll_callback on fd's wait queue
2epoll_wait(epfd, events, N, -1)Thread sleeps in epoll's wait queue until rdllist non-empty
3ep_poll_callback fires (softirq)Moves epitem to rdllist; wakes epoll_wait thread
4epoll_wait returns k eventsCopies up to N epoll_events to userspace
5Handler: read(fd, buf, len)App drains socket buffer (ET: drain loop until EAGAIN)
6Handler: write(fd, resp, n)App sends response
7a (LT)Loop back to epoll_waitfd stays armed — epoll re-reports if data remains
7b (ONESHOT)epoll_ctl(MOD, EPOLLIN|EPOLLONESHOT)Re-arm: fd was auto-disabled after first event delivery

Demo 3510 — Reactor Event Loop Simulator

Add connections, ARM them into epoll, then run epoll_wait to watch the dispatch cycle. Toggle EPOLLONESHOT to see the re-arm step appear after each write.

Loop #0
fd=10
IDLE
fd=11
IDLE
fd=12
IDLE
Event loop ready. ARM connections then run epoll_wait.
IN epollread()write()re-arm

8. § 34.8 — Echo Server Evolution

The echo server is the canonical benchmark for I/O multiplexing. Each stage removes a bottleneck from the previous one — from blocking single-client all the way to io_uring's zero-syscall proactor.

StageModelMax connsSyscalls/reqfd scanKey limitation
blockingsequential/O(n)1 client3 (accept+read+write)no — O(k)Second client hangs until first disconnects
pollsequential/O(n)~hundreds4+ (poll+accept/read+write)YES — O(n)poll() scans all fds even if 1 is ready — degrades at scale
epoll-ltreactor O(k)~10k–100k3 (epoll_wait+read+write)no — O(k)LT: fd stays in rdllist if data left — safe but extra wakeups
epoll-etreactor O(k)~100k+3 + drain loopno — O(k)Must drain fully to EAGAIN; missing data if loop exits early
io-uringproactor1M+~0 (SQPOLL)no — O(k)Linux 5.1+ required; complex setup; SQPOLL needs root or CAP_SYS_NICE

Stage Explorer

Click a stage in the table above or use the tabs below to explore each implementation.

Stage 1: Blocking Single-Client
Capacity
1 client
Syscalls/req
3 (accept+read+write)
CPU
100% blocked in read()
/* Stage 1: blocking single-client echo server */
int main(void) {
    int lfd = tcp_listen(8080);

    for (;;) {
        /* blocks here — only ONE client at a time */
        int cfd = accept(lfd, NULL, NULL);

        char buf[4096];
        ssize_t n;
        /* blocks in read() — server frozen for all others */
        while ((n = read(cfd, buf, sizeof buf)) > 0)
            write(cfd, buf, n);

        close(cfd);
        /* next accept() — previous client must be gone */
    }
}
/* syscalls per request: accept(1) + read(1) + write(1) = 3
   concurrency: 1 (second client waits in kernel accept queue) */
Limitation:Second client hangs until first disconnects
StageKey design decisionWhat it unlocks
blockingaccept() + read() in sequenceSimple; works for 1 client
poll()Single call multiplexes N fdsMultiple clients; O(n) cost
epoll LTKernel tracks readiness in RB-tree + rdllistO(k) wakeup; scales to 100k
epoll ETOnly notify on state change; drain loopFewer wakeups; lower CPU at high load
io_uring SQPOLLKernel completes I/O; shared mmap rings~0 syscalls/req; 1M+ connections

Demo 3511 — Echo Server Evolution: Syscall & Capacity Comparison

Run to see the quantitative breakdown: syscalls per second at C10K scale and the O(n) cost of poll() vs O(k) of epoll at 10,000 connections.

Echo server evolution — syscall count and capacity comparison — C Demo
stdin (optional)

Interview Prep — Must-Know Questions

QuestionKey answer
What is ep_poll_callback? When is it called?Installed as wait_queue_entry_t.func in the socket's sk_wq. Called from softirq/interrupt context when sk_data_ready() fires. Adds epitem to eventpoll.rdllist (O(1)) and wakes epoll_wait.
Why is epoll_wait O(k) not O(n)?ep_poll_callback adds only ready epitems to rdllist. epoll_wait drains rdllist — touches exactly k ready fds, never the other n−k. The RB-tree is never scanned on wakeup.
LT vs ET: what happens if you read only 50 of 100 bytes?LT: fd is re-added to rdllist — epoll_wait reports it again next call. ET: no re-add — 50 bytes sit silently in buffer until new data arrives. Mandatory: drain loop until EAGAIN.
ET drain loop — why mandatory?ET fires once per state change. If you don't read to EAGAIN, remaining data never triggers another event. Loop: while((n=read(fd,buf,sz))>0){…} if(errno!=EAGAIN){error}
What problem does EPOLLONESHOT solve?Prevents two threads from processing the same fd simultaneously. After epoll_wait returns the fd, it's auto-disabled. Re-arm with epoll_ctl(MOD, fd, EPOLLONESHOT|…) after processing.
EPOLLEXCLUSIVE vs SO_REUSEPORT?EPOLLEXCLUSIVE: one epoll_fd woken per event on shared fd (all threads share one listen socket). SO_REUSEPORT: each thread has its own listen socket; kernel load-balances accept() at the socket level.
epoll (reactor) vs io_uring (proactor): syscalls per request?epoll: 3 syscalls (epoll_wait + read + write). io_uring normal: ~2 per batch. io_uring SQPOLL: 0 syscalls — kernel thread polls SQ ring. Data is in user buffer on completion, no second read needed.
Draw eventpoll → epitem → eppoll_entry → sk_wq chain.eventpoll.rbr → epitem.rbn (RB-tree key={file*,fd}) + epitem.pwqlist → eppoll_entry.wait (func=ep_poll_callback) → installed in socket sk_wq. eventpoll.rdllist → epitem.rdllink (ready list). eventpoll.wq → sleeping epoll_wait callers.
When would you use kqueue over epoll?On BSD/macOS; or when you need EVFILT_PROC (process exit), EVFILT_SIGNAL, EVFILT_TIMER, EVFILT_USER in a single multiplexed API. epoll only handles file descriptors.