Part XXV — I/O Multiplexing

§ 33 — select / poll Internals: 5 I/O Models · fd_set · do_select() · Kernel Wait Queues · O(n) Cost

Full packet path: NIC DMA → MSI-X interrupt → softirq NAPI → tcp_v4_rcv() → sk_receive_queue → sk_data_ready() → wait queue wakeup → scheduler → do_select() re-scan → userspace (§33.4) · Five Stevens I/O models compared (§33.1) · select() fd_set bitmap and do_select() loop (§33.2) · poll() pollfd array and do_poll() (§33.3) · FD_SETSIZE limit and O(n) cost analysis (§33.5) · interview-critical: why select/poll fail at C10K, and how epoll fixes it

1. Full Packet Path — NIC to Application (7 Layers of Kernel)

Before diving into select() mechanics, you must understand the complete path a packet takes from the NIC to your application. This is what actually triggers select() to return — and why it matters for performance.

Interactive: Step Through the Packet Journey

Click through each step to see exactly what happens in the kernel when a packet arrives and a process sleeping in select() wakes up.

① NIC HardwarePacket arrives on wire → DMA → MSI-X interrupt

What happens (step by step)

Ethernet frame arrives at NIC physical port
NIC descriptor ring: NIC reads next free DMA descriptor
NIC DMA engine copies frame bytes directly into RAM (sk_buff data area) — no CPU involved
NIC marks descriptor as "used" and updates ring tail pointer
NIC fires MSI-X interrupt to a specific CPU core (RSS: fd hash selects queue/CPU)

Key data structures

NIC ring descriptor (dma_addr, len, flags)

sk_buff: skb→data points to DMA'd frame

Kernel code path

driver: napi_gro_receive(napi, skb) — schedule NAPI softirq

KEY INSIGHT

DMA bypass: frame lands in RAM without CPU copying a single byte. Interrupt only fires after DMA is done.

Step 1 / 9

Interrupt → SoftIRQ: Why Two Levels?

Level	Context	Duration	What it does
Hardware IRQ	irq_context; preemption disabled	< 5 µs	napi_schedule() — mark NAPI for softirq; disable HW interrupt; return ASAP
SoftIRQ / ksoftirqd	process context; preemption enabled	up to 2 ms budget	net_rx_action() — consume NIC ring; run protocol stack; call sk_data_ready()
sk_data_ready()	still softirq context	< 1 µs	wake_up_interruptible_all() — set process TASK_RUNNING; add to run queue
Scheduler	next tick or IPI	context switch ~1–10 µs	context-switch to woken process; resume do_select() loop

How the Socket is Found: inet_lookup_established()

When tcp_v4_rcv() receives an IP packet, it needs to find the right struct sock to deliver it to. This is a hash table lookup — not a linear scan.

Step	Operation	Data structure
Extract 4-tuple	src_ip, dst_ip, src_port, dst_port from IP+TCP headers	struct sk_buff (skb->nh.iph, skb->h.th)
Hash 4-tuple	jhash({src_ip, dst_ip, src_port, dst_port})	tcp_hashinfo.ehash[] — established hash table
Chain walk	Walk chain at hash bucket; compare full 4-tuple	struct inet_hashbucket: spinlock + hlist_head
Found sock	sock refcount++; return struct sock *	struct sock: sk_rcv_saddr, sk_daddr, sk_num, sk_dport

2. § 33.1 — Five Stevens I/O Models

Richard Stevens (UNIX Network Programming §6.2) defined 5 I/O models. The critical difference is when the application blocks and who does the data copy. Only async I/O avoids both blocking and doing the copy.

Model	Blocks in phase 1?	Blocks in phase 2?	Who copies data?	Thread model
① Blocking I/O	YES — read() blocks until data	YES — copy in read()	Kernel copies in read()	1 thread per connection
② Non-blocking I/O	NO — EAGAIN loop	YES — copy in read()	Kernel copies in read()	1 thread, but CPU spins
③ Multiplexing (select/epoll)	YES — blocks in select()	NO — read() never blocks (fd is ready)	Kernel copies in read()	1 thread, N connections
④ Signal-driven (SIGIO)	NO — does other work	NO — read() in handler	Kernel copies in read()	Tricky; rarely used
⑤ Async (io_uring)	NO	NO — kernel copies into app buffer	Kernel copies autonomously	1 thread, 0 blocking calls

Multiplexing vs Async: The Critical Distinction

I/O multiplexing (select/poll/epoll) is not async. It is readiness notification — the kernel tells you which fd is ready, then you call read() to fetch the data. Two syscalls, two phases. Async I/O (io_uring) submits the read and the kernel delivers the data directly — one operation, zero blocking.

Interactive: I/O Model Comparison

5 Stevens I/O Models — timing, syscall count, thread model — C Demo

stdin (optional)

3. § 33.2 — select() Deep Dive: fd_set Bitmap & do_select()

The fd_set Bitmap

fd_set is a fixed-size bitmap of 1024 bits (128 bytes). Each bit represents one file descriptor. Bit n is set if fd n should be monitored.

FD_SETSIZE = 1024 hard limit:You cannot pass nfds=10000 to select() — glibc's fd_set only has space for 1024 bits. Even if you could, the bitmap copy overhead would be 10000/8 = 1250 bytes per call. This is one reason select() cannot scale to C10K.

Interactive: fd_set Bitmap Visualizer

Click fds to toggle them into the monitored set. Set which fd is "ready" and run the scan to see exactly how do_select() processes the bitmap — every bit, every fd.

fd_set Bitmap — 1024 bits (click fds to toggle monitoring)

Monitored (bit=1)Ready (fd=8)Scanned + not ready (wasted)Currently scanning

Ready fd:nfds = 64 (highest fd + 1)

do_select() Kernel Flow

The actual kernel implementation in fs/select.c:do_select() does exactly what you expect: iterate every fd bit, call vfs_poll() for each set bit, and sleep if nothing is ready.

The poll_table and __pollwait

The poll_table passed to vfs_poll() has a function pointer _qproc = __pollwait. Each socket's tcp_poll() calls sock_poll_wait(file, sock, poll_table) which calls __pollwait() to add the process to the socket's wait queue. This is how the process gets notified when data arrives.

Function	Called from	What it does
do_select()	sys_select()	Outer scan loop; manages fd_set bitmaps; calls schedule_timeout()
vfs_poll(file, &table)	do_select() per fd	Calls file→f_op→poll() — dispatches to socket's tcp_poll()
tcp_poll(file, sock, table)	vfs_poll()	Checks sk_receive_queue; calls poll_wait() to register wait entry
poll_wait(file, wqh, table)	tcp_poll()	Calls table→_qproc = __pollwait()
__pollwait(file, wqh, table)	poll_wait()	Allocates poll_table_entry; adds to socket's sk_wq wait queue
sk_data_ready = sock_def_readable()	tcp_data_queue()	wake_up_interruptible_all(&sk->sk_wq->wait) — wakes all waiters

Interactive: select() O(n) Scan Simulator

do_select() scan — O(n) fd polling, FD_SETSIZE limit, copy overhead — C Demo

stdin (optional)

4. § 33.3 — poll() Deep Dive: pollfd Array & do_poll()

The pollfd Struct

poll() replaces the bitmap with an explicit array of struct pollfd. Each entry names an fd and the events to watch. The kernel fills revents on return.

struct pollfd {
int   fd;      /* 4 bytes — file descriptor to monitor */
short events;  /* 2 bytes — POLLIN | POLLOUT | POLLRDHUP | POLLERR | POLLHUP */
short revents; /* 2 bytes — filled by kernel; not modified by caller */
};  /* total: 8 bytes per fd */

do_poll() Flow

poll() Advantages Over select()

Aspect	select()	poll()
Max fd	1023 (FD_SETSIZE hardcoded)	Unlimited — array size only
Input destroyed?	YES — must reinit every call	NO — events field unchanged; only revents cleared
Memory per 1000 fds	128 bytes (fixed fd_set)	8 KB (1000 × 8 bytes pollfd)
POLLRDHUP (half-close)	NO	YES — detect TCP FIN without read()
Negative fd skip	Cannot — scans all up to nfds	Set fd=-1 to skip an entry
O(n) scan?	YES	YES — same kernel poll loop
Copy overhead	128 bytes in + 128 bytes out	8×n bytes in + 8×n bytes out
Timeout precision	microseconds (struct timeval)	milliseconds (int)

Interactive: poll() vs select() Comparison

poll() vs select() — pollfd struct, memory cost, O(n) comparison — C Demo

stdin (optional)

5. § 33.4 — Kernel Wait Queues: The Blocking/Wakeup Engine

This is the most important section for understanding IO multiplexing at the kernel level. The wait queue is the mechanism by which a process gives up the CPU when nothing is ready, and gets it back when data arrives. This is how select() can monitor thousands of fds efficiently — it does not poll in userspace; the kernel wakes it up.

Wait Queue Data Structures

Struct	Where it lives	Purpose
wait_queue_head_t sk_wq.wait	struct sock (per socket)	Head of the socket's wait queue. Spinlock + doubly-linked list of waiters.
wait_queue_entry_t	Kernel stack (poll_table_entry)	One entry per (process, fd) pair. Contains task pointer + wakeup function.
entry→func	= __pollwait_key for select/poll	Called when wakeup fires. Sets TASK_RUNNING on the sleeping process.
entry→private	= struct task_struct *	The sleeping process. try_to_wake_up() is called on this.
poll_table_entry.wait_address	= &sk→sk_wq.wait	Back-pointer to the wait queue head this entry is registered in.

Full Wakeup Sequence (from Interrupt to do_select() returning)

How a Process Goes from SLEEPING to RUNNING

The state transition happens in try_to_wake_up() in kernel/sched/core.c:

/* Simplified do_select() sleep path */
schedule_timeout(timeout_jiffies):
set_current_state(TASK_INTERRUPTIBLE);
schedule();  /* CPU released; this call doesn't return until woken */
/* Meanwhile, in softirq context (sk_data_ready): */
wake_up_interruptible_all(&sk->sk_wq->wait):
/* for each entry in wait list: */
try_to_wake_up(task, TASK_INTERRUPTIBLE, 0):
task->state = TASK_RUNNING;
enqueue_task(rq, task); /* add to CFS run queue */
/* Back in do_select() — schedule() returns: */
/* re-scan all n fds in fd_set loop */

Interactive: Thundering Herd — Wait Queue Simulator

Simulate three processes all sleeping in select() on the same socket. When data arrives, watch how all three wake up and only one gets the data.

Wait Queue: Thundering Herd SimulatorStep 1 / 6

Event: Initial state

All three processes called select() and included fd=8 (a socket). All three are sleeping (TASK_INTERRUPTIBLE) in the socket's wait queue.

wait_queue_head_t sk→sk_wq.wait(socket fd=8 wait queue)

Process A (select: fd 5,8,12)

wait_queue_entry_t → private=A

TASK_INTERRUPTIBLE (sleeping)

Process B (select: fd 3,8,20)

wait_queue_entry_t → private=B

TASK_INTERRUPTIBLE (sleeping)

Process C (select: fd 8,15)

wait_queue_entry_t → private=C

TASK_INTERRUPTIBLE (sleeping)

The Thundering Herd Problem

When multiple processes call select() on the same socket, all of them register a wait_queue_entry in the socket's wait queue. When data arrives, wake_up_interruptible_all() wakes all of them. Each wastes a context switch and O(n) fd scan, but only one gets the data.

epoll solves this with EPOLLEXCLUSIVE (Linux 4.5): marks the wait queue entry as exclusive (WQ_FLAG_EXCLUSIVE), so only one process is woken per event. SO_REUSEPORT is an alternative: each process gets its own socket; the kernel load-balances new connections.

Wait Queue Cleanup on Wakeup

Event	What happens to wait entries
fd becomes ready	do_select() re-polls; poll_freewait() removes entries from ALL wait queues the process registered in
Timeout expires	schedule_timeout() returns -ETIME; poll_freewait() cleans up all wait entries
Signal received	TASK_INTERRUPTIBLE woken by signal; do_select() returns -EINTR; poll_freewait() cleans up
select() returns normally	poll_freewait(&table) walks all poll_table_entry structs; list_del(&entry->wait.entry) removes from each wait_queue_head

6. § 33.5 — FD_SETSIZE Limit & O(n) Cost Analysis

Why FD_SETSIZE = 1024 Is a Hard Limit

FD_SETSIZE is defined in <sys/select.h> as 1024. The fd_set struct is statically sized: 128 bytes (1024 bits). You cannot pass nfds=2000 to select() without allocating a custom oversized fd_set — and even then, glibc macros will overflow.

/* glibc bits/typesizes.h */
#define FD_SETSIZE  1024  /* can't increase without recompiling glibc */
/* fd_set is a fixed array */
typedef struct {
unsigned long __fds_bits[FD_SETSIZE / (8 * sizeof(long))]; /* 16 uint64_t = 128 bytes */
} fd_set;
/* FD_SET(fd, set) — UNDEFINED BEHAVIOR if fd >= FD_SETSIZE */
#define FD_SET(fd, set)  (set).__fds_bits[(fd)/64] |= (1UL << ((fd) % 64))

Per-Call Cost Breakdown at N=10,000 Connections

Cost component	select() per call	poll() per call	epoll_wait() per call
fd_set / pollfd copy into kernel	128 bytes (fixed)	8×N = 80KB	0 bytes (state in kernel)
Scan / vfs_poll calls	O(n) — all nfds bits	O(n) — all N pollfds	O(k) — only ready fds
Result copy to userspace	128 bytes (fixed)	8×N = 80KB	16×k bytes (only ready)
Setup for next call	FD_ZERO + FD_SET N times	No setup needed	No setup needed
FD limit	1023 hard	Unlimited	Unlimited
Typical syscalls per event	1 select + 1 read = 2	1 poll + 1 read = 2	1 epoll_wait + 1 read = 2

The C10K Problem

Dan Kegel's 1999 “C10K problem” identified that with 10,000 concurrent connections and 100 events/second:

Blocking model: 10,000 threads × ~8 MB stack = 80 GB RAM. OS limit ~few thousand threads.
select(): 10,000 fd scans × 100 events/sec = 1,000,000 vfs_poll() calls/sec. 99.99% wasted. Plus fd_set copy overhead. Plus hit FD_SETSIZE=1024 anyway.
poll(): Same O(n) scan; no fd limit. But 10K × 8 bytes = 80KB copied kernel↔user per call × 100/sec = 8MB/sec of pointless kernel↔user copies.
epoll: Only ready fds are ever touched. Cost is O(k) where k is the number of active connections. 10,000 idle connections cost exactly 0. This is the fix.

Scenario (10K connections, 100 events/s)	vfs_poll calls/sec	Waste
select(nfds=10000)	10000 × 100 = 1,000,000	99.99%
poll(fds, 10000)	10000 × 100 = 1,000,000	99.99%
epoll_wait() (100 ready)	100 × 100 = 10,000	0%

When select() / poll() Are Acceptable

N < 50–100 fds: O(n) cost is negligible; code simplicity wins
Short-lived connections: setup overhead matters less than at steady-state
Portability: select() works everywhere; epoll is Linux-only
stdin + a few network fds: typical CLI tool use case
But: never use select() if any fd might be > 1023 (file descriptors from daemon code commonly exceed this)

7. Interview Prep — Questions This Page Answers

#	Question	Key answer
Q1	What happens between a NIC receiving a packet and select() returning?	NIC DMA → MSI-X IRQ → napi_schedule → softirq net_rx_action → tcp_v4_rcv → sk_data_ready → wake_up_interruptible_all → try_to_wake_up → do_select re-scan
Q2	What is FD_SETSIZE? Why can't you do select(nfds=10000)?	fd_set is 128 bytes (1024 bits) hardcoded. nfds > 1024 causes undefined behavior in FD_SET macros. poll() has no such limit.
Q3	Why is select() O(n)?	do_select() calls vfs_poll() for every fd in the input bitmap on every call, even if only 1 is ready. All n fds are scanned per wakeup.
Q4	How does a process sleeping in select() get woken up?	__pollwait adds process to socket's wait queue. sk_data_ready → wake_up_interruptible_all → try_to_wake_up sets TASK_RUNNING and adds to run queue.
Q5	What is the difference between select() and poll()?	Both O(n). poll() has no FD_SETSIZE limit, doesn't destroy input, supports POLLRDHUP. select() uses bitmap; poll() uses pollfd array (8B/fd).
Q6	What is the thundering herd problem with select()?	Multiple processes watching the same fd all wake up on one event. Only one gets the data; others waste a context switch + O(n) scan.
Q7	What is the difference between I/O multiplexing and async I/O?	Multiplexing: kernel notifies READINESS; app still calls read() to get data (2 phases). Async: kernel performs the I/O and delivers data to app buffer directly (1 phase, 0 blocking).
Q8	How does select() know which fd in the set is the one that became ready?	On wakeup, do_select() re-scans the ENTIRE fd_set calling vfs_poll() for each fd. The one(s) returning POLLIN are set in the result bitmap. No shortcut — O(n) every time.