§ 33 — select / poll Internals: 5 I/O Models · fd_set · do_select() · Kernel Wait Queues · O(n) Cost
Full packet path: NIC DMA → MSI-X interrupt → softirq NAPI → tcp_v4_rcv() → sk_receive_queue → sk_data_ready() → wait queue wakeup → scheduler → do_select() re-scan → userspace (§33.4) · Five Stevens I/O models compared (§33.1) · select() fd_set bitmap and do_select() loop (§33.2) · poll() pollfd array and do_poll() (§33.3) · FD_SETSIZE limit and O(n) cost analysis (§33.5) · interview-critical: why select/poll fail at C10K, and how epoll fixes it
1. Full Packet Path — NIC to Application (7 Layers of Kernel)
Before diving into select() mechanics, you must understand the complete path a packet takes from the NIC to your application. This is what actually triggers select() to return — and why it matters for performance.
Interactive: Step Through the Packet Journey
Click through each step to see exactly what happens in the kernel when a packet arrives and a process sleeping in select() wakes up.
Interrupt → SoftIRQ: Why Two Levels?
| Level | Context | Duration | What it does |
|---|---|---|---|
| Hardware IRQ | irq_context; preemption disabled | < 5 µs | napi_schedule() — mark NAPI for softirq; disable HW interrupt; return ASAP |
| SoftIRQ / ksoftirqd | process context; preemption enabled | up to 2 ms budget | net_rx_action() — consume NIC ring; run protocol stack; call sk_data_ready() |
| sk_data_ready() | still softirq context | < 1 µs | wake_up_interruptible_all() — set process TASK_RUNNING; add to run queue |
| Scheduler | next tick or IPI | context switch ~1–10 µs | context-switch to woken process; resume do_select() loop |
How the Socket is Found: inet_lookup_established()
When tcp_v4_rcv() receives an IP packet, it needs to find the right struct sock to deliver it to. This is a hash table lookup — not a linear scan.
| Step | Operation | Data structure |
|---|---|---|
| Extract 4-tuple | src_ip, dst_ip, src_port, dst_port from IP+TCP headers | struct sk_buff (skb->nh.iph, skb->h.th) |
| Hash 4-tuple | jhash({src_ip, dst_ip, src_port, dst_port}) | tcp_hashinfo.ehash[] — established hash table |
| Chain walk | Walk chain at hash bucket; compare full 4-tuple | struct inet_hashbucket: spinlock + hlist_head |
| Found sock | sock refcount++; return struct sock * | struct sock: sk_rcv_saddr, sk_daddr, sk_num, sk_dport |
2. § 33.1 — Five Stevens I/O Models
Richard Stevens (UNIX Network Programming §6.2) defined 5 I/O models. The critical difference is when the application blocks and who does the data copy. Only async I/O avoids both blocking and doing the copy.
| Model | Blocks in phase 1? | Blocks in phase 2? | Who copies data? | Thread model |
|---|---|---|---|---|
| ① Blocking I/O | YES — read() blocks until data | YES — copy in read() | Kernel copies in read() | 1 thread per connection |
| ② Non-blocking I/O | NO — EAGAIN loop | YES — copy in read() | Kernel copies in read() | 1 thread, but CPU spins |
| ③ Multiplexing (select/epoll) | YES — blocks in select() | NO — read() never blocks (fd is ready) | Kernel copies in read() | 1 thread, N connections |
| ④ Signal-driven (SIGIO) | NO — does other work | NO — read() in handler | Kernel copies in read() | Tricky; rarely used |
| ⑤ Async (io_uring) | NO | NO — kernel copies into app buffer | Kernel copies autonomously | 1 thread, 0 blocking calls |
Multiplexing vs Async: The Critical Distinction
I/O multiplexing (select/poll/epoll) is not async. It is readiness notification — the kernel tells you which fd is ready, then you call read() to fetch the data. Two syscalls, two phases. Async I/O (io_uring) submits the read and the kernel delivers the data directly — one operation, zero blocking.
Interactive: I/O Model Comparison
3. § 33.2 — select() Deep Dive: fd_set Bitmap & do_select()
The fd_set Bitmap
fd_set is a fixed-size bitmap of 1024 bits (128 bytes). Each bit represents one file descriptor. Bit n is set if fd n should be monitored.
nfds=10000 to select() — glibc's fd_set only has space for 1024 bits. Even if you could, the bitmap copy overhead would be 10000/8 = 1250 bytes per call. This is one reason select() cannot scale to C10K.Interactive: fd_set Bitmap Visualizer
Click fds to toggle them into the monitored set. Set which fd is "ready" and run the scan to see exactly how do_select() processes the bitmap — every bit, every fd.
do_select() Kernel Flow
The actual kernel implementation in fs/select.c:do_select() does exactly what you expect: iterate every fd bit, call vfs_poll() for each set bit, and sleep if nothing is ready.
The poll_table and __pollwait
The poll_table passed to vfs_poll() has a function pointer _qproc = __pollwait. Each socket's tcp_poll() calls sock_poll_wait(file, sock, poll_table) which calls __pollwait() to add the process to the socket's wait queue. This is how the process gets notified when data arrives.
| Function | Called from | What it does |
|---|---|---|
| do_select() | sys_select() | Outer scan loop; manages fd_set bitmaps; calls schedule_timeout() |
| vfs_poll(file, &table) | do_select() per fd | Calls file→f_op→poll() — dispatches to socket's tcp_poll() |
| tcp_poll(file, sock, table) | vfs_poll() | Checks sk_receive_queue; calls poll_wait() to register wait entry |
| poll_wait(file, wqh, table) | tcp_poll() | Calls table→_qproc = __pollwait() |
| __pollwait(file, wqh, table) | poll_wait() | Allocates poll_table_entry; adds to socket's sk_wq wait queue |
| sk_data_ready = sock_def_readable() | tcp_data_queue() | wake_up_interruptible_all(&sk->sk_wq->wait) — wakes all waiters |
Interactive: select() O(n) Scan Simulator
4. § 33.3 — poll() Deep Dive: pollfd Array & do_poll()
The pollfd Struct
poll() replaces the bitmap with an explicit array of struct pollfd. Each entry names an fd and the events to watch. The kernel fills revents on return.
do_poll() Flow
poll() Advantages Over select()
| Aspect | select() | poll() |
|---|---|---|
| Max fd | 1023 (FD_SETSIZE hardcoded) | Unlimited — array size only |
| Input destroyed? | YES — must reinit every call | NO — events field unchanged; only revents cleared |
| Memory per 1000 fds | 128 bytes (fixed fd_set) | 8 KB (1000 × 8 bytes pollfd) |
| POLLRDHUP (half-close) | NO | YES — detect TCP FIN without read() |
| Negative fd skip | Cannot — scans all up to nfds | Set fd=-1 to skip an entry |
| O(n) scan? | YES | YES — same kernel poll loop |
| Copy overhead | 128 bytes in + 128 bytes out | 8×n bytes in + 8×n bytes out |
| Timeout precision | microseconds (struct timeval) | milliseconds (int) |
Interactive: poll() vs select() Comparison
5. § 33.4 — Kernel Wait Queues: The Blocking/Wakeup Engine
This is the most important section for understanding IO multiplexing at the kernel level. The wait queue is the mechanism by which a process gives up the CPU when nothing is ready, and gets it back when data arrives. This is how select() can monitor thousands of fds efficiently — it does not poll in userspace; the kernel wakes it up.
Wait Queue Data Structures
| Struct | Where it lives | Purpose |
|---|---|---|
| wait_queue_head_t sk_wq.wait | struct sock (per socket) | Head of the socket's wait queue. Spinlock + doubly-linked list of waiters. |
| wait_queue_entry_t | Kernel stack (poll_table_entry) | One entry per (process, fd) pair. Contains task pointer + wakeup function. |
| entry→func | = __pollwait_key for select/poll | Called when wakeup fires. Sets TASK_RUNNING on the sleeping process. |
| entry→private | = struct task_struct * | The sleeping process. try_to_wake_up() is called on this. |
| poll_table_entry.wait_address | = &sk→sk_wq.wait | Back-pointer to the wait queue head this entry is registered in. |
Full Wakeup Sequence (from Interrupt to do_select() returning)
How a Process Goes from SLEEPING to RUNNING
The state transition happens in try_to_wake_up() in kernel/sched/core.c:
Interactive: Thundering Herd — Wait Queue Simulator
Simulate three processes all sleeping in select() on the same socket. When data arrives, watch how all three wake up and only one gets the data.
The Thundering Herd Problem
When multiple processes call select() on the same socket, all of them register a wait_queue_entry in the socket's wait queue. When data arrives, wake_up_interruptible_all() wakes all of them. Each wastes a context switch and O(n) fd scan, but only one gets the data.
epoll solves this with EPOLLEXCLUSIVE (Linux 4.5): marks the wait queue entry as exclusive (WQ_FLAG_EXCLUSIVE), so only one process is woken per event. SO_REUSEPORT is an alternative: each process gets its own socket; the kernel load-balances new connections.
Wait Queue Cleanup on Wakeup
| Event | What happens to wait entries |
|---|---|
| fd becomes ready | do_select() re-polls; poll_freewait() removes entries from ALL wait queues the process registered in |
| Timeout expires | schedule_timeout() returns -ETIME; poll_freewait() cleans up all wait entries |
| Signal received | TASK_INTERRUPTIBLE woken by signal; do_select() returns -EINTR; poll_freewait() cleans up |
| select() returns normally | poll_freewait(&table) walks all poll_table_entry structs; list_del(&entry->wait.entry) removes from each wait_queue_head |
6. § 33.5 — FD_SETSIZE Limit & O(n) Cost Analysis
Why FD_SETSIZE = 1024 Is a Hard Limit
FD_SETSIZE is defined in <sys/select.h> as 1024. The fd_set struct is statically sized: 128 bytes (1024 bits). You cannot pass nfds=2000 to select() without allocating a custom oversized fd_set — and even then, glibc macros will overflow.
Per-Call Cost Breakdown at N=10,000 Connections
| Cost component | select() per call | poll() per call | epoll_wait() per call |
|---|---|---|---|
| fd_set / pollfd copy into kernel | 128 bytes (fixed) | 8×N = 80KB | 0 bytes (state in kernel) |
| Scan / vfs_poll calls | O(n) — all nfds bits | O(n) — all N pollfds | O(k) — only ready fds |
| Result copy to userspace | 128 bytes (fixed) | 8×N = 80KB | 16×k bytes (only ready) |
| Setup for next call | FD_ZERO + FD_SET N times | No setup needed | No setup needed |
| FD limit | 1023 hard | Unlimited | Unlimited |
| Typical syscalls per event | 1 select + 1 read = 2 | 1 poll + 1 read = 2 | 1 epoll_wait + 1 read = 2 |
The C10K Problem
Dan Kegel's 1999 “C10K problem” identified that with 10,000 concurrent connections and 100 events/second:
- Blocking model: 10,000 threads × ~8 MB stack = 80 GB RAM. OS limit ~few thousand threads.
- select(): 10,000 fd scans × 100 events/sec = 1,000,000
vfs_poll()calls/sec. 99.99% wasted. Plus fd_set copy overhead. Plus hit FD_SETSIZE=1024 anyway. - poll(): Same O(n) scan; no fd limit. But 10K × 8 bytes = 80KB copied kernel↔user per call × 100/sec = 8MB/sec of pointless kernel↔user copies.
- epoll: Only ready fds are ever touched. Cost is O(k) where k is the number of active connections. 10,000 idle connections cost exactly 0. This is the fix.
| Scenario (10K connections, 100 events/s) | vfs_poll calls/sec | Waste |
|---|---|---|
| select(nfds=10000) | 10000 × 100 = 1,000,000 | 99.99% |
| poll(fds, 10000) | 10000 × 100 = 1,000,000 | 99.99% |
| epoll_wait() (100 ready) | 100 × 100 = 10,000 | 0% |
When select() / poll() Are Acceptable
- N < 50–100 fds: O(n) cost is negligible; code simplicity wins
- Short-lived connections: setup overhead matters less than at steady-state
- Portability: select() works everywhere; epoll is Linux-only
- stdin + a few network fds: typical CLI tool use case
- But: never use select() if any fd might be > 1023 (file descriptors from daemon code commonly exceed this)
7. Interview Prep — Questions This Page Answers
| # | Question | Key answer |
|---|---|---|
| Q1 | What happens between a NIC receiving a packet and select() returning? | NIC DMA → MSI-X IRQ → napi_schedule → softirq net_rx_action → tcp_v4_rcv → sk_data_ready → wake_up_interruptible_all → try_to_wake_up → do_select re-scan |
| Q2 | What is FD_SETSIZE? Why can't you do select(nfds=10000)? | fd_set is 128 bytes (1024 bits) hardcoded. nfds > 1024 causes undefined behavior in FD_SET macros. poll() has no such limit. |
| Q3 | Why is select() O(n)? | do_select() calls vfs_poll() for every fd in the input bitmap on every call, even if only 1 is ready. All n fds are scanned per wakeup. |
| Q4 | How does a process sleeping in select() get woken up? | __pollwait adds process to socket's wait queue. sk_data_ready → wake_up_interruptible_all → try_to_wake_up sets TASK_RUNNING and adds to run queue. |
| Q5 | What is the difference between select() and poll()? | Both O(n). poll() has no FD_SETSIZE limit, doesn't destroy input, supports POLLRDHUP. select() uses bitmap; poll() uses pollfd array (8B/fd). |
| Q6 | What is the thundering herd problem with select()? | Multiple processes watching the same fd all wake up on one event. Only one gets the data; others waste a context switch + O(n) scan. |
| Q7 | What is the difference between I/O multiplexing and async I/O? | Multiplexing: kernel notifies READINESS; app still calls read() to get data (2 phases). Async: kernel performs the I/O and delivers data to app buffer directly (1 phase, 0 blocking). |
| Q8 | How does select() know which fd in the set is the one that became ready? | On wakeup, do_select() re-scans the ENTIRE fd_set calling vfs_poll() for each fd. The one(s) returning POLLIN are set in the result bitmap. No shortcut — O(n) every time. |