Part XXV — I/O Multiplexing

§ 33 — select / poll Internals: 5 I/O Models · fd_set · do_select() · Kernel Wait Queues · O(n) Cost

Full packet path: NIC DMA → MSI-X interrupt → softirq NAPI → tcp_v4_rcv() → sk_receive_queue → sk_data_ready() → wait queue wakeup → scheduler → do_select() re-scan → userspace (§33.4) · Five Stevens I/O models compared (§33.1) · select() fd_set bitmap and do_select() loop (§33.2) · poll() pollfd array and do_poll() (§33.3) · FD_SETSIZE limit and O(n) cost analysis (§33.5) · interview-critical: why select/poll fail at C10K, and how epoll fixes it

1. Full Packet Path — NIC to Application (7 Layers of Kernel)

Before diving into select() mechanics, you must understand the complete path a packet takes from the NIC to your application. This is what actually triggers select() to return — and why it matters for performance.

Interactive: Step Through the Packet Journey

Click through each step to see exactly what happens in the kernel when a packet arrives and a process sleeping in select() wakes up.

① NIC HardwarePacket arrives on wire → DMA → MSI-X interrupt
What happens (step by step)
  1. Ethernet frame arrives at NIC physical port
  2. NIC descriptor ring: NIC reads next free DMA descriptor
  3. NIC DMA engine copies frame bytes directly into RAM (sk_buff data area) — no CPU involved
  4. NIC marks descriptor as "used" and updates ring tail pointer
  5. NIC fires MSI-X interrupt to a specific CPU core (RSS: fd hash selects queue/CPU)
Key data structures
NIC ring descriptor (dma_addr, len, flags)
sk_buff: skb→data points to DMA'd frame
Kernel code path
driver: napi_gro_receive(napi, skb) — schedule NAPI softirq
KEY INSIGHT
DMA bypass: frame lands in RAM without CPU copying a single byte. Interrupt only fires after DMA is done.
Step 1 / 9

Interrupt → SoftIRQ: Why Two Levels?

LevelContextDurationWhat it does
Hardware IRQirq_context; preemption disabled< 5 µsnapi_schedule() — mark NAPI for softirq; disable HW interrupt; return ASAP
SoftIRQ / ksoftirqdprocess context; preemption enabledup to 2 ms budgetnet_rx_action() — consume NIC ring; run protocol stack; call sk_data_ready()
sk_data_ready()still softirq context< 1 µswake_up_interruptible_all() — set process TASK_RUNNING; add to run queue
Schedulernext tick or IPIcontext switch ~1–10 µscontext-switch to woken process; resume do_select() loop

How the Socket is Found: inet_lookup_established()

When tcp_v4_rcv() receives an IP packet, it needs to find the right struct sock to deliver it to. This is a hash table lookup — not a linear scan.

StepOperationData structure
Extract 4-tuplesrc_ip, dst_ip, src_port, dst_port from IP+TCP headersstruct sk_buff (skb->nh.iph, skb->h.th)
Hash 4-tuplejhash({src_ip, dst_ip, src_port, dst_port})tcp_hashinfo.ehash[] — established hash table
Chain walkWalk chain at hash bucket; compare full 4-tuplestruct inet_hashbucket: spinlock + hlist_head
Found socksock refcount++; return struct sock *struct sock: sk_rcv_saddr, sk_daddr, sk_num, sk_dport

2. § 33.1 — Five Stevens I/O Models

Richard Stevens (UNIX Network Programming §6.2) defined 5 I/O models. The critical difference is when the application blocks and who does the data copy. Only async I/O avoids both blocking and doing the copy.

ModelBlocks in phase 1?Blocks in phase 2?Who copies data?Thread model
① Blocking I/OYES — read() blocks until dataYES — copy in read()Kernel copies in read()1 thread per connection
② Non-blocking I/ONO — EAGAIN loopYES — copy in read()Kernel copies in read()1 thread, but CPU spins
③ Multiplexing (select/epoll)YES — blocks in select()NO — read() never blocks (fd is ready)Kernel copies in read()1 thread, N connections
④ Signal-driven (SIGIO)NO — does other workNO — read() in handlerKernel copies in read()Tricky; rarely used
⑤ Async (io_uring)NONO — kernel copies into app bufferKernel copies autonomously1 thread, 0 blocking calls

Multiplexing vs Async: The Critical Distinction

I/O multiplexing (select/poll/epoll) is not async. It is readiness notification — the kernel tells you which fd is ready, then you call read() to fetch the data. Two syscalls, two phases. Async I/O (io_uring) submits the read and the kernel delivers the data directly — one operation, zero blocking.

Interactive: I/O Model Comparison

5 Stevens I/O Models — timing, syscall count, thread model — C Demo
stdin (optional)

3. § 33.2 — select() Deep Dive: fd_set Bitmap & do_select()

The fd_set Bitmap

fd_set is a fixed-size bitmap of 1024 bits (128 bytes). Each bit represents one file descriptor. Bit n is set if fd n should be monitored.

FD_SETSIZE = 1024 hard limit:You cannot pass nfds=10000 to select() — glibc's fd_set only has space for 1024 bits. Even if you could, the bitmap copy overhead would be 10000/8 = 1250 bytes per call. This is one reason select() cannot scale to C10K.

Interactive: fd_set Bitmap Visualizer

Click fds to toggle them into the monitored set. Set which fd is "ready" and run the scan to see exactly how do_select() processes the bitmap — every bit, every fd.

fd_set Bitmap — 1024 bits (click fds to toggle monitoring)
Monitored (bit=1)Ready (fd=8)Scanned + not ready (wasted)Currently scanning
nfds = 64 (highest fd + 1)

do_select() Kernel Flow

The actual kernel implementation in fs/select.c:do_select() does exactly what you expect: iterate every fd bit, call vfs_poll() for each set bit, and sleep if nothing is ready.

The poll_table and __pollwait

The poll_table passed to vfs_poll() has a function pointer _qproc = __pollwait. Each socket's tcp_poll() calls sock_poll_wait(file, sock, poll_table) which calls __pollwait() to add the process to the socket's wait queue. This is how the process gets notified when data arrives.

FunctionCalled fromWhat it does
do_select()sys_select()Outer scan loop; manages fd_set bitmaps; calls schedule_timeout()
vfs_poll(file, &table)do_select() per fdCalls file→f_op→poll() — dispatches to socket's tcp_poll()
tcp_poll(file, sock, table)vfs_poll()Checks sk_receive_queue; calls poll_wait() to register wait entry
poll_wait(file, wqh, table)tcp_poll()Calls table→_qproc = __pollwait()
__pollwait(file, wqh, table)poll_wait()Allocates poll_table_entry; adds to socket's sk_wq wait queue
sk_data_ready = sock_def_readable()tcp_data_queue()wake_up_interruptible_all(&sk->sk_wq->wait) — wakes all waiters

Interactive: select() O(n) Scan Simulator

do_select() scan — O(n) fd polling, FD_SETSIZE limit, copy overhead — C Demo
stdin (optional)

4. § 33.3 — poll() Deep Dive: pollfd Array & do_poll()

The pollfd Struct

poll() replaces the bitmap with an explicit array of struct pollfd. Each entry names an fd and the events to watch. The kernel fills revents on return.

struct pollfd {
int fd; /* 4 bytes — file descriptor to monitor */
short events; /* 2 bytes — POLLIN | POLLOUT | POLLRDHUP | POLLERR | POLLHUP */
short revents; /* 2 bytes — filled by kernel; not modified by caller */
}; /* total: 8 bytes per fd */

do_poll() Flow

poll() Advantages Over select()

Aspectselect()poll()
Max fd1023 (FD_SETSIZE hardcoded)Unlimited — array size only
Input destroyed?YES — must reinit every callNO — events field unchanged; only revents cleared
Memory per 1000 fds128 bytes (fixed fd_set)8 KB (1000 × 8 bytes pollfd)
POLLRDHUP (half-close)NOYES — detect TCP FIN without read()
Negative fd skipCannot — scans all up to nfdsSet fd=-1 to skip an entry
O(n) scan?YESYES — same kernel poll loop
Copy overhead128 bytes in + 128 bytes out8×n bytes in + 8×n bytes out
Timeout precisionmicroseconds (struct timeval)milliseconds (int)

Interactive: poll() vs select() Comparison

poll() vs select() — pollfd struct, memory cost, O(n) comparison — C Demo
stdin (optional)

5. § 33.4 — Kernel Wait Queues: The Blocking/Wakeup Engine

This is the most important section for understanding IO multiplexing at the kernel level. The wait queue is the mechanism by which a process gives up the CPU when nothing is ready, and gets it back when data arrives. This is how select() can monitor thousands of fds efficiently — it does not poll in userspace; the kernel wakes it up.

Wait Queue Data Structures

StructWhere it livesPurpose
wait_queue_head_t sk_wq.waitstruct sock (per socket)Head of the socket's wait queue. Spinlock + doubly-linked list of waiters.
wait_queue_entry_tKernel stack (poll_table_entry)One entry per (process, fd) pair. Contains task pointer + wakeup function.
entry→func= __pollwait_key for select/pollCalled when wakeup fires. Sets TASK_RUNNING on the sleeping process.
entry→private= struct task_struct *The sleeping process. try_to_wake_up() is called on this.
poll_table_entry.wait_address= &sk→sk_wq.waitBack-pointer to the wait queue head this entry is registered in.

Full Wakeup Sequence (from Interrupt to do_select() returning)

How a Process Goes from SLEEPING to RUNNING

The state transition happens in try_to_wake_up() in kernel/sched/core.c:

/* Simplified do_select() sleep path */
schedule_timeout(timeout_jiffies):
set_current_state(TASK_INTERRUPTIBLE);
schedule(); /* CPU released; this call doesn't return until woken */
/* Meanwhile, in softirq context (sk_data_ready): */
wake_up_interruptible_all(&sk->sk_wq->wait):
/* for each entry in wait list: */
try_to_wake_up(task, TASK_INTERRUPTIBLE, 0):
task->state = TASK_RUNNING;
enqueue_task(rq, task); /* add to CFS run queue */
/* Back in do_select() — schedule() returns: */
/* re-scan all n fds in fd_set loop */

Interactive: Thundering Herd — Wait Queue Simulator

Simulate three processes all sleeping in select() on the same socket. When data arrives, watch how all three wake up and only one gets the data.

Wait Queue: Thundering Herd SimulatorStep 1 / 6
Event: Initial state
All three processes called select() and included fd=8 (a socket). All three are sleeping (TASK_INTERRUPTIBLE) in the socket's wait queue.
wait_queue_head_t sk→sk_wq.wait(socket fd=8 wait queue)
Process A (select: fd 5,8,12)
wait_queue_entry_t → private=A
TASK_INTERRUPTIBLE (sleeping)
Process B (select: fd 3,8,20)
wait_queue_entry_t → private=B
TASK_INTERRUPTIBLE (sleeping)
Process C (select: fd 8,15)
wait_queue_entry_t → private=C
TASK_INTERRUPTIBLE (sleeping)

The Thundering Herd Problem

When multiple processes call select() on the same socket, all of them register a wait_queue_entry in the socket's wait queue. When data arrives, wake_up_interruptible_all() wakes all of them. Each wastes a context switch and O(n) fd scan, but only one gets the data.

epoll solves this with EPOLLEXCLUSIVE (Linux 4.5): marks the wait queue entry as exclusive (WQ_FLAG_EXCLUSIVE), so only one process is woken per event. SO_REUSEPORT is an alternative: each process gets its own socket; the kernel load-balances new connections.

Wait Queue Cleanup on Wakeup

EventWhat happens to wait entries
fd becomes readydo_select() re-polls; poll_freewait() removes entries from ALL wait queues the process registered in
Timeout expiresschedule_timeout() returns -ETIME; poll_freewait() cleans up all wait entries
Signal receivedTASK_INTERRUPTIBLE woken by signal; do_select() returns -EINTR; poll_freewait() cleans up
select() returns normallypoll_freewait(&table) walks all poll_table_entry structs; list_del(&entry->wait.entry) removes from each wait_queue_head

6. § 33.5 — FD_SETSIZE Limit & O(n) Cost Analysis

Why FD_SETSIZE = 1024 Is a Hard Limit

FD_SETSIZE is defined in <sys/select.h> as 1024. The fd_set struct is statically sized: 128 bytes (1024 bits). You cannot pass nfds=2000 to select() without allocating a custom oversized fd_set — and even then, glibc macros will overflow.

/* glibc bits/typesizes.h */
#define FD_SETSIZE 1024 /* can't increase without recompiling glibc */
/* fd_set is a fixed array */
typedef struct {
unsigned long __fds_bits[FD_SETSIZE / (8 * sizeof(long))]; /* 16 uint64_t = 128 bytes */
} fd_set;
/* FD_SET(fd, set) — UNDEFINED BEHAVIOR if fd >= FD_SETSIZE */
#define FD_SET(fd, set) (set).__fds_bits[(fd)/64] |= (1UL << ((fd) % 64))

Per-Call Cost Breakdown at N=10,000 Connections

Cost componentselect() per callpoll() per callepoll_wait() per call
fd_set / pollfd copy into kernel128 bytes (fixed)8×N = 80KB0 bytes (state in kernel)
Scan / vfs_poll callsO(n) — all nfds bitsO(n) — all N pollfdsO(k) — only ready fds
Result copy to userspace128 bytes (fixed)8×N = 80KB16×k bytes (only ready)
Setup for next callFD_ZERO + FD_SET N timesNo setup neededNo setup needed
FD limit1023 hardUnlimitedUnlimited
Typical syscalls per event1 select + 1 read = 21 poll + 1 read = 21 epoll_wait + 1 read = 2

The C10K Problem

Dan Kegel's 1999 “C10K problem” identified that with 10,000 concurrent connections and 100 events/second:

  • Blocking model: 10,000 threads × ~8 MB stack = 80 GB RAM. OS limit ~few thousand threads.
  • select(): 10,000 fd scans × 100 events/sec = 1,000,000 vfs_poll() calls/sec. 99.99% wasted. Plus fd_set copy overhead. Plus hit FD_SETSIZE=1024 anyway.
  • poll(): Same O(n) scan; no fd limit. But 10K × 8 bytes = 80KB copied kernel↔user per call × 100/sec = 8MB/sec of pointless kernel↔user copies.
  • epoll: Only ready fds are ever touched. Cost is O(k) where k is the number of active connections. 10,000 idle connections cost exactly 0. This is the fix.
Scenario (10K connections, 100 events/s)vfs_poll calls/secWaste
select(nfds=10000)10000 × 100 = 1,000,00099.99%
poll(fds, 10000)10000 × 100 = 1,000,00099.99%
epoll_wait() (100 ready)100 × 100 = 10,0000%

When select() / poll() Are Acceptable

  • N < 50–100 fds: O(n) cost is negligible; code simplicity wins
  • Short-lived connections: setup overhead matters less than at steady-state
  • Portability: select() works everywhere; epoll is Linux-only
  • stdin + a few network fds: typical CLI tool use case
  • But: never use select() if any fd might be > 1023 (file descriptors from daemon code commonly exceed this)

7. Interview Prep — Questions This Page Answers

#QuestionKey answer
Q1What happens between a NIC receiving a packet and select() returning?NIC DMA → MSI-X IRQ → napi_schedule → softirq net_rx_action → tcp_v4_rcv → sk_data_ready → wake_up_interruptible_all → try_to_wake_up → do_select re-scan
Q2What is FD_SETSIZE? Why can't you do select(nfds=10000)?fd_set is 128 bytes (1024 bits) hardcoded. nfds > 1024 causes undefined behavior in FD_SET macros. poll() has no such limit.
Q3Why is select() O(n)?do_select() calls vfs_poll() for every fd in the input bitmap on every call, even if only 1 is ready. All n fds are scanned per wakeup.
Q4How does a process sleeping in select() get woken up?__pollwait adds process to socket's wait queue. sk_data_ready → wake_up_interruptible_all → try_to_wake_up sets TASK_RUNNING and adds to run queue.
Q5What is the difference between select() and poll()?Both O(n). poll() has no FD_SETSIZE limit, doesn't destroy input, supports POLLRDHUP. select() uses bitmap; poll() uses pollfd array (8B/fd).
Q6What is the thundering herd problem with select()?Multiple processes watching the same fd all wake up on one event. Only one gets the data; others waste a context switch + O(n) scan.
Q7What is the difference between I/O multiplexing and async I/O?Multiplexing: kernel notifies READINESS; app still calls read() to get data (2 phases). Async: kernel performs the I/O and delivers data to app buffer directly (1 phase, 0 blocking).
Q8How does select() know which fd in the set is the one that became ready?On wakeup, do_select() re-scans the ENTIRE fd_set calling vfs_poll() for each fd. The one(s) returning POLLIN are set in the result bitmap. No shortcut — O(n) every time.