End-to-End Interactive: NIC → read() Returns
Walk the complete kernel path — from NIC DMA descriptor to userspace buffer — across nine subsystems with interactive struct inspection, select vs epoll parallel comparison, and C10K scaling visualization.
§ 39.1 — The 9-Step Journey (epoll path)
A TCP packet arriving at the NIC travels through nine distinct kernel subsystems before the application's read() returns data. The process sleeps in TASK_INTERRUPTIBLE the entire time — from NIC DMA through softirq, protocol stack, wait queue traversal, and into ep_poll_callback — only transitioning to TASK_RUNNING when try_to_wake_up enqueues it on the CFS run queue at step 6.
| Step | Kernel function | Process state |
|---|---|---|
| 1 NIC DMA | igb hardware | INTERRUPTIBLE |
| 2 IRQ | igb_intr | INTERRUPTIBLE |
| 3 NAPI | igb_poll | INTERRUPTIBLE |
| 4 L2–L4 | ip_rcv → tcp_v4_rcv | INTERRUPTIBLE |
| 5 sk_data_ready | sock_def_readable | INTERRUPTIBLE |
| 6 ep_poll_callback | ep_poll_callback | INTERRUPTIBLE → RUNNING |
| 7 Wakeup | try_to_wake_up | RUNNING |
| 8 epoll_wait returns | ep_send_events | RUNNING |
| 9 read() | tcp_recvmsg | RUNNING |
Demo 3521 — 9-Step Packet Journey
Click any step or use Prev / Next to walk through the kernel call chain. Yellow-highlighted fields are the ones modified or read at that step. The right panel shows process state, sk_buff location, and the memory delta at each stage.
§ 39.2 — select() vs epoll() Side-by-Side
Both select() and epoll() share the same sk_data_ready → wake_up trigger at step 5. The divergence happens immediately after: select wakes the process to re-scan all N fds, while epoll's ep_poll_callback inserts exactly one epitem into the ready list before waking the process. The result is O(N) work for select versus O(k) for epoll at every wakeup.
| Step | select() path | epoll() path |
|---|---|---|
| Registration | At select() call: N × poll_wait() across N sockets | At epoll_ctl(ADD): 1 eppoll_entry per socket, persists |
| Step 5 wakeup | wake_up_all() wakes sleeping process | ep_poll_callback adds to rdllist, then wakes |
| After wakeup | Process re-polls ALL N fds (vfs_poll() × N) | Process reads ready fds from rdllist only |
| Cost | O(N) — scan all N fds | O(k) — only k ready fds |
| Cleanup | Remove all N wait_queue_entry_t | Nothing to remove |
| Next call cost | Re-register all N wait entries | 0 — eppoll_entries persist |
Demo 3522 — select vs epoll Parallel Animation
Adjust N (number of monitored fds) and trigger one fd becoming ready. Watch select scan all N file descriptors while epoll completes in a single rdllist insertion.
§ 39.3 — 10K Connection Stress Visualization
At C10K scale — 10,000 simultaneous connections — the O(N) cost of select() becomes intolerable. Every wakeup, even for a single arriving packet, forces the kernel to call vfs_poll() on all 10,000 file descriptors. epoll() performs exactly k operations — one per ready fd — regardless of how many total connections are registered.
The fd_set copy cost is also linear: with 10,000 connections, select copies a 1,250-byte bitmap to the kernel on every call, while epoll copies only k × 12 bytes of epoll_event structs back to userspace.
Demo 3523 — 10K Connection Scaling
Adjust N (total connections) and K (ready fds per tick), then press Run to simulate 20 ticks. At N=10,000, K=1, the select bar fills the entire row while the epoll bar is a single pixel.
§ 39.4 — Complete Data Structure Map
Every kernel object touched during the NIC-to-read() journey exists in one of six subsystems. The diagram below shows how they chain together: a process fd table slot points into the VFS layer, which leads to a socket, whose wait queue head holds the eppoll_entry that bridges into the epoll instance. The NIC delivers data via DMA into an sk_buff, which travels up the protocol stack and lands in sk_receive_queue — the same sock that owns the wait queue.
| Struct | Layer | Key fields |
|---|---|---|
task_struct | Process | files → files_struct |
fdtable | fd table | fd[]: O(1) array lookup by fd number |
struct file | VFS | f_op (socket_file_ops), private_data → socket |
struct socket | POSIX layer | ops (inet_stream_ops), sk → sock, wq |
struct sock | Protocol layer | sk_receive_queue, sk_wq, sk_data_ready |
socket_wq | Wait | wait (wait_queue_head_t) |
eppoll_entry | epoll bridge | wait.func=ep_poll_callback, base → epitem |
epitem | epoll | fd, event, rdllink (list_head into rdllist) |
eventpoll | epoll instance | rbr (RB-tree of epitems), rdllist, wq |
sk_buff | Packet | head/data/tail/end (linear buffer), sk = sock |
eppoll_entry.base → epitem (installed once at epoll_ctl(ADD), lives until epoll_ctl(DEL)) and sk_buff.sk → sock (set by tcp_v4_rcv after the 4-tuple hash lookup). These two pointers are what allow a raw DMA buffer to wake a specific sleeping process in O(1) without scanning any fd list.Interview Prep — Synthesis Questions
These questions require connecting multiple subsystems. A strong answer names the specific kernel function and data structure at each stage.
| Question | Answer |
|---|---|
| Every major kernel function from NIC to epoll_wait returning? | igb_intr → napi_schedule → igb_poll → netif_receive_skb → ip_rcv → tcp_v4_rcv → tcp_rcv_established → skb_queue_tail → sock_def_readable → ep_poll_callback → wake_up → try_to_wake_up → schedule → ep_send_events → epoll_wait returns |
| At what step is the process added to the CPU run queue? Which function? | Step 6: ep_poll_callback calls wake_up → try_to_wake_up → ttwu_queue |
| Where does sk_data_ready fit? Default implementation for TCP? | Called after skb enters sk_receive_queue. Default: sock_def_readable() → wake_up on sk_wq->wait |
| Why O(n) for select but O(k) for epoll at wakeup? | Same sk_data_ready → wake_up trigger. select's pollwake wakes process to re-scan N fds. epoll's ep_poll_callback inserts 1 epitem into rdllist — only ready fds are touched. |
| N=10000 epoll-registered fds, 1 packet: how many wait queue accesses? | 1 traversal — only the eppoll_entry for the ready socket's callback fires. The other 9,999 eppoll_entries are not touched. |
| Wait queue entry lifetime: select vs epoll? | select: N entries created each call, removed on return. epoll: eppoll_entries installed once at epoll_ctl(ADD), live until epoll_ctl(DEL) or fd close. |
| Why does ep_poll_callback run in softirq context? What constraints? | sk_data_ready called from tcp_rcv_established, which runs in NAPI softirq. Constraints: cannot sleep, must use spinlock not mutex, must complete quickly. |