§ 35 — libev Internals: Loop Phases · Watcher Types · 4-Heap Timer · ANFD Map · Fork Safety · Threading
Why libev bridges the gap between raw epoll and a real event loop (§35.1) · ev_loop struct and ANFD fd map (§35.2) · six loop phases in ev_run() (§35.3) · ev_io lifecycle through epoll_ctl and dispatch (§35.4) · 4-heap timer algorithm and cache efficiency (§35.5) · ev_timer vs ev_periodic clock semantics (§35.6) · signal delivery via self-pipe (§35.7) · all watcher types and their syscalls (§35.8) · priority queue dispatch (§35.9) · fork safety and ev_loop_fork() (§35.10) · threading model and ev_async (§35.11) · ANFD merger reduces epoll_ctl calls (§35.12) · worked echo server O(k) dispatch (§35.14) · libev vs libevent vs libuv (§35.15)
1. § 35.1 — Why libev? From Raw epoll to an Event Library
epoll_wait() answers one question: which fds are readable or writable right now? A real server also needs timers, signal handling, child-process exit notification, cross-thread wakeup, and periodic callbacks — all wired together correctly, fork-safely, and with minimal overhead. libev provides this through a single abstraction: a watcher that attaches a callback to any event source.
Full Syscall Path: Application → libev → Kernel → Hardware
When your application calls ev_io_start(loop, w) for a socket fd, here is the exact path through the kernel before your callback fires:
ev_io_start()links the watcher intoanfds[fd].headand addsfdtofdchanges[](no syscall yet — deferred).- On the next
ev_run()iteration Phase 2:epoll_ctl(epfd, EPOLL_CTL_ADD, fd, &ev)— kernel inserts anepiteminto theeventpollRB-tree. It also installsep_poll_callbackas a wait-queue entry in the socket'ssk_wq. - Phase 4:
epoll_wait(epfd, events, 64, timeout_ms)— process sleeps onep->wq. - NIC receives a packet → DMA →
sk_data_ready()→ep_poll_callback()fires in softirq context → appendsepitemtordllist→wake_up(&ep->wq). epoll_waitreturns → libev looks upanfds[fd]→ walks theev_iowatcher list → adds matching watchers topendings[priority].- Phase 6: watcher callback fires in the event loop thread.
libev vs libevent vs libuv
| Aspect | libev | libevent | libuv |
|---|---|---|---|
| Timer heap | 4-heap (cache-optimal) | Binary heap | Binary heap |
| Bytes/watcher | ~56 B (ev_io) | ~120 B (event) | ~120 B (uv_poll_t) |
| I/O buffering | None — user handles read/write | bufferevent (rx/tx queues) | Stream with alloc_cb |
| Thread pool | None | None (use threads addon) | Yes — uv_work_t for blocking ops |
| DNS | None | evdns (async) | c-ares via thread pool |
| Timer clock | double (CLOCK_MONOTONIC, ns) | timeval (µs) | uint64_t (ms) |
| Used by | nginx (historical), many daemons | Tor, tmux, Chromium | Node.js, Julia |
Interactive: Backend Selection
2. § 35.2 — The Loop: struct ev_loop Internals
Every libev application creates one ev_loop per thread (or uses the default loop). The struct is the central hub connecting all watcher types to the backend. Its key fields fall into four groups:
ev_loop Field Map
| Field | Type | Role |
|---|---|---|
backend_fd | int | epoll fd (Linux) or kqueue fd (BSD) — created by epoll_create1(EPOLL_CLOEXEC) in ev_loop_new() |
anfds / anfdmax | ANFD[] / int | Array indexed by fd number. Each ANFD holds emask (current epoll interest) and the linked list of ev_io watchers for that fd |
fdchanges[] | int[] | Pending fd updates. fds whose interest mask changed since last iteration. Flushed to epoll_ctl in Phase 2 — never called immediately on ev_io_start/stop |
timers / timercnt | ev_timer** / int | 4-ary min-heap of ev_timer structs. Root (index 0) always holds the earliest deadline |
periodics / periodiccnt | ev_periodic** / int | Separate min-heap for ev_periodic (wall-clock recurrence, self-correcting after NTP jumps) |
pendings[NUMPRI][] | WATCHER*[] | Per-priority arrays of ready-to-dispatch watchers. Indexed by priority level (EV_MINPRI=-2 to EV_MAXPRI=2) |
prepares / checks / idles | linked lists | ev_prepare (Phase 1), ev_check (post-Phase 4), ev_idle (when no other pending) watcher lists |
async_fd | int | eventfd or self-pipe read end. Internal ev_io watches this — woken by ev_async_send() from other threads |
curpid | pid_t | Cached getpid() result. At loop start: if getpid() != curpid, fork was detected → ev_loop_fork() called automatically |
now / now_floor | double | Cached CLOCK_MONOTONIC time (seconds). Updated once per iteration to minimize clock_gettime() overhead |
ANFD Array — the fd → watcher Index
anfds[fd] is libev's merge point for all watchers on the same fd. When multiple ev_io watchers monitor the same fd, only ONE epitem exists in the kernel — the union of all interest masks is passed to epoll_ctl(MOD).
Interactive: ev_loop Struct Explorer
3. § 35.3 — The Loop Iteration: ev_run() Phase-by-Phase
Each call to ev_run(loop, 0) enters a tight loop. Every iteration has exactly 6 phases in strict order. Understanding each phase tells you exactly when a syscall is made and why.
| Phase | What happens | Syscall | Data structure read/written |
|---|---|---|---|
| 1 — Prepare | All ev_prepare callbacks fire (e.g., integrate foreign event loops, flush buffers) | None | prepares linked list |
| 2 — fd flush | fdchanges[] drained: compute new interest mask per fd, call epoll_ctl ADD/MOD/DEL | epoll_ctl() | fdchanges[], anfds[].emask, kernel RB-tree |
| 3 — Timeout | Peek 4-heap root (earliest timer), compute epoll_wait ms timeout | clock_gettime() (vDSO, no context switch) | timers[0].at, ev_now() |
| 4 — Poll | Block in epoll_wait; on return, walk anfds[fd] for each ready event, enqueue pendings | epoll_wait() | rdllist → anfds → pendings[] |
| 5 — Timers | Update ev_now(); pop expired timers from 4-heap; mark pending + re-heap if repeat>0 | clock_gettime() (vDSO) | timers[] 4-heap, pendings[] |
| 6 — Dispatch | Drain pendings[EV_MAXPRI] first down to EV_MINPRI; call ev_check; call ev_idle if pendings empty | None (user callbacks) | pendings[5][] |
Why fdchanges Defers epoll_ctl
A naive implementation would call epoll_ctl() immediately inside ev_io_start(). libev instead queues the fd into fdchanges[] and flushes them all in Phase 2. This batches multiple watcher changes on the same fd into a single epoll_ctl call. If a callback calls ev_io_stop then ev_io_start on the same fd, only one epoll_ctl(MOD) is issued — not a DEL followed by an ADD.
Interactive: ev_run() Phase Stepper
4. § 35.4 — ev_io Watcher: From Registration to Callback
ev_io is the most commonly used watcher. It bridges application fd interest to the backend epoll. The complete path from ev_io_start() to callback invocation touches three data structures: the anfds[] array, fdchanges[], and pendings[].
ev_io_start() → epoll_ctl → Dispatch
| Step | Location | What happens |
|---|---|---|
| ev_io_start(loop, w) | libev (userspace) | Link w into anfds[w->fd].head; compute new interest mask; if changed: add fd to fdchanges[] |
| Phase 2 flush | libev (userspace) | epoll_ctl(epfd, ADD|MOD|DEL, fd, mask) — batch call covering all fdchanges |
| Kernel: epoll_ctl ADD | kernel (syscall) | Insert epitem into eventpoll RB-tree; install ep_poll_callback as wait_queue entry in sock->sk_wq |
| Data arrives on fd | kernel softirq | NIC DMA → sk_data_ready() → ep_poll_callback() → append epitem to rdllist → wake_up(&ep->wq) |
| epoll_wait returns | libev (userspace) | events[i] = {fd, EPOLLIN}; walk anfds[fd].head; for each w: if w->events & revents → pendings[w->priority].append(w) |
| Phase 6 dispatch | libev (userspace) | w->cb(loop, w, EV_READ) — your application callback |
Interactive: ev_io Lifecycle (Start → epoll_ctl → Event → Stop)
Interactive: ANFD Merger — Multiple Watchers, One epitem
5. § 35.5 — Timer Internals: The 4-Heap Algorithm
libev uses a quaternary min-heap (4-heap) rather than the more common binary heap. The root always holds the earliest deadline; its value is peeked in Phase 3 to compute the epoll_wait timeout.
4-Heap Tree Structure
Each node at index i has children at indices 4i+1, 4i+2, 4i+3, 4i+4. Its parent is at (i-1)/4. Compared to a binary heap:
| Property | Binary heap (libevent) | 4-heap (libev) |
|---|---|---|
| Children per node | 2 | 4 |
| Tree height for N=1M | log₂(1M) = 20 levels | log₄(1M) = 10 levels |
| Comparisons per insert (sift-up) | 20 × 1 = 20 | 10 × 3 = 30 (compare up to 3 siblings) |
| Comparisons per delete (sift-down) | 20 × 2 = 40 | 10 × 4 = 40 |
| Cache lines per sift-down level | 2 children = 16B → may span 2 cache lines | 4 children = 32B → ONE 64B cache line |
| Real-world performance | Baseline | ~20% faster due to cache locality |
Heap Operations
ev_timer_start(loop, w): computew->at = ev_now() + w->after; append totimers[timercnt++];upheap(timercnt - 1)— O(log₄ n)ev_timer_stop(loop, w): swaptimers[w->active]withtimers[--timercnt]; calldownheap()orupheap()on the moved element — O(log₄ n)- Phase 5 expiry:
while (timers[0]->at <= ev_now())→ pop root, mark pending, ifw->repeat > 0:w->at = ev_now() + w->repeat→ re-heap
Interactive: 4-Heap Visualizer
6. § 35.6 — ev_timer vs ev_periodic: Monotonic vs Wall Clock
| Property | ev_timer | ev_periodic |
|---|---|---|
| Clock used | CLOCK_MONOTONIC (never jumps backward) | CLOCK_REALTIME (wall clock, can jump) |
| Scheduling | Relative: now + after, repeat from current ev_now() | Absolute: next multiple of (offset + N × interval) |
| NTP jump behavior | Unaffected — monotonic clock continues | Self-corrects to next wall-clock slot |
| Drift | Accumulates if callback is slow (reschedules from ev_now()) | No drift — always anchors to wall-clock grid |
| Reschedule callback | None | Optional reschedule_cb(w, now) for arbitrary schedule logic |
| Use case | Timeouts, retries, keepalives | Cron-like jobs that must fire at wall-clock times |
Interactive: Timer Drift and NTP Correction
7. § 35.7 — Signal Watchers: ev_signal and the Signal Pipe
Signal handlers can interrupt any code — including malloc, epoll bookkeeping, and your own data structures. Calling libev dispatch functions from a signal handler is not async-signal-safe. libev's solution: the signal handler only writes 1 byte to a self-pipe (or increments an eventfd), and the actual ev_signal callback fires safely in the event loop thread.
Signal Delivery Path Step-by-Step
ev_signal_start(loop, w, SIGINT)→sigaction(SIGINT, &sa, NULL)installs handler → creates self-pipe withpipe2(O_NONBLOCK|O_CLOEXEC)→ internalev_ioonpipefd[0]registered with epoll.- SIGINT arrives → kernel delivers signal → handler runs:
signals[SIGINT].pending = 1;write(pipefd[1], "\0", 1)— async-signal-safe. epoll_waitreturns EPOLLIN onpipefd[0].- libev reads and drains the pipe; scans
signals[]; dispatches all pendingev_signalcallbacks.
Interactive: Signal Pipe Delivery
8. § 35.8 — All Watcher Types and Their Syscalls
| Watcher | Registration syscall | Event delivery | Typical use |
|---|---|---|---|
ev_io | epoll_ctl(ADD/MOD) | epoll_wait returns fd ready | Network I/O, any fd |
ev_timer | None (heap insert) | epoll_wait timeout expires | Connection timeout, retry |
ev_periodic | None (heap insert) | epoll_wait timeout, wall-clock aligned | Cron-like, hourly jobs |
ev_signal | sigaction(signum) + pipe2() | EPOLLIN on self-pipe | SIGINT/SIGTERM handlers |
ev_child | sigaction(SIGCHLD) | SIGCHLD → pipe → waitpid(WNOHANG) | Child process exit |
ev_stat (Linux) | inotify_add_watch() | EPOLLIN on inotify_fd → stat() | File/dir change detection |
ev_idle | None | Fires when pendings[] empty (no kernel) | Background GC, log flush |
ev_prepare | None | Phase 1 hook (before epoll_wait) | State update, foreign loop |
ev_check | None | Post-Phase 4 hook (after epoll_wait) | Foreign loop dispatch |
ev_async | eventfd() or pipe2() | EPOLLIN on eventfd from any thread | Cross-thread wakeup |
ev_embed | nested ev_loop | ev_io on nested loop backend_fd | Embed foreign event loop |
ev_fork | None | fork detection via getpid() check | Re-init after fork |
ev_child — waitpid in the Event Loop
ev_child reuses the signal pipe: SIGCHLD fires → pipe wakes epoll → libev calls waitpid(-1, &status, WNOHANG) in a loop until it returns 0 → dispatches matching ev_child watchers by PID. The watcher's rstatus field contains the exit status for inspection in the callback.
ev_async — the Only Thread-Safe libev Call
ev_async_send(loop, w) may be called from any thread or signal handler. It calls write(loop->async_fd, &one, 8) on an eventfd, which is specified as atomic. The main loop's epoll_wait returns EPOLLIN on async_fd; libev reads to reset the counter and dispatches all pending ev_async watchers.
9. § 35.10 — Fork Safety: Why epoll Breaks After fork()
fork() copies the parent's address space, including ev_loop and its backend_fd (the epoll fd). Both parent and child now hold file descriptors pointing to the same kernel eventpoll object. The child's new sockets are not registered, and calling epoll_wait in the child monitors the parent's file descriptors — a serious bug.
ev_loop_fork() — What It Does
close(old_epfd)— only closes this process's fd; the parent's epoll is unaffected.epfd = epoll_create1(EPOLL_CLOEXEC)— fresh kerneleventpollfor the child.- Iterate all active
anfds[]: for each entry with non-zeroemask, callepoll_ctl(new_epfd, ADD, fd, mask)— O(active fds). - Re-initialize signal pipe and
async_fd. curpid = getpid()— resets fork detection for the child's own future forks.
libev detects fork automatically: at the start of each ev_run()iteration it compares getpid() against loop->curpid. If they differ, it calls ev_loop_fork() before proceeding. You can also call it explicitly right after fork().
Interactive: fork() + epoll Bug and Fix
10. § 35.11 — Threading Model and ev_async
libev's threading rule is simple: one event loop per thread; never call ev_io_start/stop or ev_run from a different thread. The only cross-thread primitive is ev_async_send().
Correct Multi-Thread Pattern
- Main thread creates
ev_asyncwatcher, starts it, then callsev_run(). - Worker threads do CPU-intensive work (compute, compress, encrypt).
- Worker completion:
ev_async_send(loop, &async_w)— writes toeventfd, thread-safe. - Main thread wakes from
epoll_wait→ dispatchesev_asynccallback → safely callsev_io_start/stop, enqueues response, modifies loop state.
Interactive: Cross-Thread Wakeup with ev_async
11. § 35.14 — Worked Example: Echo Server O(k) Dispatch
Sequence: Client Connect → Data → Echo
Why O(k), Not O(N)
With 10,000 connected clients and only 3 sending data simultaneously, epoll_wait returns 3 events. libev looks up anfds[fd_A], anfds[fd_B], anfds[fd_C] — the other 9,997 clients are never touched. Compare with select/poll which scans all 10,000 fds every wakeup.
Interactive: O(k) Dispatch Demo
12. § 35.15 — libev vs libevent vs libuv
Interactive: Timer Heap and Memory Comparison
13. Interview Prep
| Question | Answer |
|---|---|
| Walk through one full ev_run() iteration for a server with 10 ev_io watchers and 2 ev_timer watchers. | Phase 1: ev_prepare callbacks (none if not registered). Phase 2: flush fdchanges[] — call epoll_ctl ADD/MOD/DEL for any changed fds. Phase 3: peek timers[0] from 4-heap root, compute epoll_wait timeout = min(deadline - ev_now(), max_block). Phase 4: epoll_wait() blocks; on return walk anfds[fd] for each ready event, enqueue matching ev_io watchers into pendings[priority]. Phase 5: clock_gettime(CLOCK_MONOTONIC), pop expired timers from 4-heap, mark pending + re-heap if repeat>0. Phase 6: drain pendings highest-priority first, call callbacks; then ev_check; then ev_idle if pendings empty. |
| What is ANFD? Why does libev need it? How many epoll_ctl calls result from 3 ev_io watchers on the same fd? | ANFD (Array of Notified File Descriptors) is anfds[fd] — one entry per fd, holding a linked list of all ev_io watchers for that fd and the current epoll interest mask (emask). It is needed because epoll allows only ONE epitem per (epfd, fd) pair, but multiple watchers (e.g., EV_READ + EV_WRITE) may exist. libev merges all watcher event masks via OR and issues ONE epoll_ctl call. With 3 ev_io watchers on the same fd: still just 1 epoll_ctl ADD (or MOD if mask changes). |
| Why does libev use a 4-heap instead of a binary heap for timers? Quantify the cache benefit. | A 4-heap has log₄(n) height vs log₂(n) for a binary heap — half the levels for the same N. During sift-down, all 4 children of a node fit in one 64-byte cache line (4 × 8 bytes of pointers). A binary heap may need 2 cache line accesses for 2 children. With N=1M timers: binary heap needs 20 levels × 2 cache accesses = 40 cache misses per op; 4-heap needs 10 levels × 1 cache access = 10 cache misses. Real-world: ~20% faster in timer-heavy benchmarks. |
| What is the difference between ev_timer and ev_periodic? Which clock does each use? What happens to each after an NTP time jump? | ev_timer uses CLOCK_MONOTONIC and schedules relative to ev_now() (current loop time). After a slow callback, it reschedules from the current time, so it can drift. NTP jumps do not affect it. ev_periodic uses CLOCK_REALTIME and anchors to absolute wall-clock slots (offset + N × interval). After an NTP jump, ev_periodic self-corrects by advancing its deadline to the next valid wall-clock slot. Use ev_timer for timeouts and retries; ev_periodic for cron-like tasks that must fire at wall-clock times. |
| Explain why fork() breaks an inherited epoll fd. What does ev_loop_fork() do? | fork() copies the parent's ev_loop including backend_fd. Both parent and child now hold fds pointing to the SAME kernel eventpoll object. The child's new sockets are not in that eventpoll; if the child calls epoll_wait it monitors the parent's fds. ev_loop_fork() fixes this: close(old_epfd), epoll_create1() for a fresh eventpoll, re-register all active anfds with epoll_ctl(ADD), re-init signal pipe, update curpid. libev also detects fork automatically via getpid() != curpid at the start of ev_run(). |
| How does ev_signal deliver signal events safely? Draw the path. | sigaction(signum) installs a handler; the handler only calls write(pipefd[1], 1 byte) — async-signal-safe. An internal ev_io monitors pipefd[0] with EPOLLIN registered in epoll. When the signal arrives: signal handler writes to pipe → epoll_wait returns EPOLLIN on pipe fd → libev reads pipe to drain, checks signals[signum].pending flag → dispatches ev_signal callback in the event loop thread. This guarantees the callback runs in a safe, non-signal context. |
| What is ev_async_send()? Why is it the only libev function safe to call from another thread? | ev_async_send(loop, w) writes to an eventfd (or pipe) using write(), which is specified as async-signal-safe and can be called from any thread. The main loop's epoll_wait returns EPOLLIN on the eventfd fd, drains it with read(), then dispatches ev_async callbacks — all in the main thread where it is safe to modify loop state. All other libev functions (ev_io_start, ev_io_stop, etc.) must run in the loop's owning thread. |
| How does fdchanges[] reduce syscall overhead? | ev_io_start/stop never call epoll_ctl immediately. Instead they add the fd to fdchanges[]. Phase 2 of each ev_run() iteration flushes all changes in a batch. If start + stop + start happen on the same fd in one iteration, only one epoll_ctl(MOD) is issued — not ADD+DEL+ADD. For a server accepting many connections in one callback, all new fds are batched into one set of epoll_ctl calls at the next Phase 2, not one syscall per accept(). |
| Compare the delivery path for ev_io EV_READ vs ev_timer — which involves a kernel interrupt? | ev_io EV_READ: NIC interrupt → softirq NAPI → sk_data_ready() → ep_poll_callback() (still in softirq) → rdllist append → wake_up() → epoll_wait returns → anfds[fd] lookup → pendings → callback. ev_timer: no kernel interrupt at all. libev peeks the 4-heap root in Phase 3 to compute epoll_wait timeout. When epoll_wait returns (either due to I/O or timeout), Phase 5 checks clock_gettime() against heap[0].at and pops expired timers. The timer fires via a timeout in epoll_wait, not via an IRQ. |
| How does the pendings[] priority array work? | pendings is an array of arrays: pendings[EV_MAXPRI - EV_MINPRI + 1] = pendings[5]. Each slot holds watchers at one priority level. Phase 6 drains from index 4 (EV_MAXPRI=2) down to index 0 (EV_MINPRI=-2). Set priority with ev_set_priority(w, EV_MAXPRI) before ev_start. Priority only affects dispatch order within one loop iteration, not which event epoll_wait returns first. |