I/O Multiplexing — Part XXXIX

End-to-End Interactive: NIC → read() Returns

Walk the complete kernel path — from NIC DMA descriptor to userspace buffer — across nine subsystems with interactive struct inspection, select vs epoll parallel comparison, and C10K scaling visualization.

§ 39.1 — The 9-Step Journey (epoll path)

A TCP packet arriving at the NIC travels through nine distinct kernel subsystems before the application's read() returns data. The process sleeps in TASK_INTERRUPTIBLE the entire time — from NIC DMA through softirq, protocol stack, wait queue traversal, and into ep_poll_callback — only transitioning to TASK_RUNNING when try_to_wake_up enqueues it on the CFS run queue at step 6.

StepKernel functionProcess state
1 NIC DMAigb hardwareINTERRUPTIBLE
2 IRQigb_intrINTERRUPTIBLE
3 NAPIigb_pollINTERRUPTIBLE
4 L2–L4ip_rcv → tcp_v4_rcvINTERRUPTIBLE
5 sk_data_readysock_def_readableINTERRUPTIBLE
6 ep_poll_callbackep_poll_callbackINTERRUPTIBLE → RUNNING
7 Wakeuptry_to_wake_upRUNNING
8 epoll_wait returnsep_send_eventsRUNNING
9 read()tcp_recvmsgRUNNING

Demo 3521 — 9-Step Packet Journey

Click any step or use Prev / Next to walk through the kernel call chain. Yellow-highlighted fields are the ones modified or read at that step. The right panel shows process state, sk_buff location, and the memory delta at each stage.

Active struct
struct rx_desc (RX Ring)
FieldValue
buf_addr0xffff888004a00000
length128
status: DD0 → 1
NIC DMA-writes 128B into pre-allocated kernel buffer. Sets DD (Descriptor Done) bit in RX ring descriptor to signal completion.
Function
igb_dma_write (NIC hardware)
Source file
hardware / DMA engine
Process state
TASK_INTERRUPTIBLE
sk_buff
In NIC DMA buffer (not yet sk_buff)
Memory Δ
RX ring descriptor: status DD=0 → DD=1
Step 1 of 9

§ 39.2 — select() vs epoll() Side-by-Side

Both select() and epoll() share the same sk_data_ready → wake_up trigger at step 5. The divergence happens immediately after: select wakes the process to re-scan all N fds, while epoll's ep_poll_callback inserts exactly one epitem into the ready list before waking the process. The result is O(N) work for select versus O(k) for epoll at every wakeup.

Stepselect() pathepoll() path
RegistrationAt select() call: N × poll_wait() across N socketsAt epoll_ctl(ADD): 1 eppoll_entry per socket, persists
Step 5 wakeupwake_up_all() wakes sleeping processep_poll_callback adds to rdllist, then wakes
After wakeupProcess re-polls ALL N fds (vfs_poll() × N)Process reads ready fds from rdllist only
CostO(N) — scan all N fdsO(k) — only k ready fds
CleanupRemove all N wait_queue_entry_tNothing to remove
Next call costRe-register all N wait entries0 — eppoll_entries persist

Demo 3522 — select vs epoll Parallel Animation

Adjust N (number of monitored fds) and trigger one fd becoming ready. Watch select scan all N file descriptors while epoll completes in a single rdllist insertion.

N =100
► SHARED: sk_data_ready → wake_up triggered on socket wait queue
select() path
vfs_poll calls: 0 / 100
— waiting —
epoll() path
rdllist ops: 0
rdllist: empty
— waiting —
Press ▶ to simulate one fd becoming ready with N=100 monitored connections.

§ 39.3 — 10K Connection Stress Visualization

At C10K scale — 10,000 simultaneous connections — the O(N) cost of select() becomes intolerable. Every wakeup, even for a single arriving packet, forces the kernel to call vfs_poll() on all 10,000 file descriptors. epoll() performs exactly k operations — one per ready fd — regardless of how many total connections are registered.

The fd_set copy cost is also linear: with 10,000 connections, select copies a 1,250-byte bitmap to the kernel on every call, while epoll copies only k × 12 bytes of epoll_event structs back to userspace.

Demo 3523 — 10K Connection Scaling

Adjust N (total connections) and K (ready fds per tick), then press Run to simulate 20 ticks. At N=10,000, K=1, the select bar fills the entire row while the epoll bar is a single pixel.

N (connections)1,000
K (ready fds)5
N
1,000
K
5
select ops/tick
1,000
epoll ops/tick
5
Ratio
200×
ops per tick — bar width proportional to work done
select(): 1,000 vfs_poll calls
epoll(): 5 rdllist ops
fd_set copy (select)
250 bytes
1000/8⌉ × 2 = 250B
epoll_event copy
60 bytes
5 × 12B = 60B
20-tick cumulative ops/tick (0/20 ticks recorded)
Press ▶ Run to generate data

§ 39.4 — Complete Data Structure Map

Every kernel object touched during the NIC-to-read() journey exists in one of six subsystems. The diagram below shows how they chain together: a process fd table slot points into the VFS layer, which leads to a socket, whose wait queue head holds the eppoll_entry that bridges into the epoll instance. The NIC delivers data via DMA into an sk_buff, which travels up the protocol stack and lands in sk_receive_queue — the same sock that owns the wait queue.

StructLayerKey fields
task_structProcessfiles → files_struct
fdtablefd tablefd[]: O(1) array lookup by fd number
struct fileVFSf_op (socket_file_ops), private_data → socket
struct socketPOSIX layerops (inet_stream_ops), sk → sock, wq
struct sockProtocol layersk_receive_queue, sk_wq, sk_data_ready
socket_wqWaitwait (wait_queue_head_t)
eppoll_entryepoll bridgewait.func=ep_poll_callback, base → epitem
epitemepollfd, event, rdllink (list_head into rdllist)
eventpollepoll instancerbr (RB-tree of epitems), rdllist, wq
sk_buffPackethead/data/tail/end (linear buffer), sk = sock
The two critical pointers: eppoll_entry.base → epitem (installed once at epoll_ctl(ADD), lives until epoll_ctl(DEL)) and sk_buff.sk → sock (set by tcp_v4_rcv after the 4-tuple hash lookup). These two pointers are what allow a raw DMA buffer to wake a specific sleeping process in O(1) without scanning any fd list.

Interview Prep — Synthesis Questions

These questions require connecting multiple subsystems. A strong answer names the specific kernel function and data structure at each stage.

QuestionAnswer
Every major kernel function from NIC to epoll_wait returning?igb_intr → napi_schedule → igb_poll → netif_receive_skb → ip_rcv → tcp_v4_rcv → tcp_rcv_established → skb_queue_tail → sock_def_readable → ep_poll_callback → wake_up → try_to_wake_up → schedule → ep_send_events → epoll_wait returns
At what step is the process added to the CPU run queue? Which function?Step 6: ep_poll_callback calls wake_up → try_to_wake_up → ttwu_queue
Where does sk_data_ready fit? Default implementation for TCP?Called after skb enters sk_receive_queue. Default: sock_def_readable() → wake_up on sk_wq->wait
Why O(n) for select but O(k) for epoll at wakeup?Same sk_data_ready → wake_up trigger. select's pollwake wakes process to re-scan N fds. epoll's ep_poll_callback inserts 1 epitem into rdllist — only ready fds are touched.
N=10000 epoll-registered fds, 1 packet: how many wait queue accesses?1 traversal — only the eppoll_entry for the ready socket's callback fires. The other 9,999 eppoll_entries are not touched.
Wait queue entry lifetime: select vs epoll?select: N entries created each call, removed on return. epoll: eppoll_entries installed once at epoll_ctl(ADD), live until epoll_ctl(DEL) or fd close.
Why does ep_poll_callback run in softirq context? What constraints?sk_data_ready called from tcp_rcv_established, which runs in NAPI softirq. Constraints: cannot sleep, must use spinlock not mutex, must complete quickly.