Part XXV — I/O Multiplexing

§ 35 — Packet Path: NIC → Kernel Socket Buffer

NIC RX ring: DMA descriptor + DD bit + zero-copy (§35.1) · NAPI softirq: igb_poll() budget loop + GRO coalescing (§35.2) · Protocol stack: netif_receive_skb → ip_rcv → tcp_v4_rcv → sk_receive_queue (§35.3) · sk_data_ready → wait queue: pollwake O(N) vs ep_poll_callback O(1) (§35.4) · sk_buff memory layout: head/data/tail/end · skb_push / skb_pull zero-copy (§35.5)

1. § 35.1 — NIC Receives a Packet: DMA & IRQ

Before any kernel networking code runs, the NIC hardware has already written the packet into memory. The mechanism is DMA (Direct Memory Access): the NIC writes directly into a pre-allocated kernel buffer with zero CPU involvement. Only after the transfer completes does the CPU get involved — via a hardware interrupt.

NIC RX Descriptor Ring

The driver pre-allocates a circular array of descriptors. Each descriptor holds a buf_addr (pointer to a kernel buffer), a len field, and a status field. When a packet arrives, the NIC DMA-writes the payload directly into buf_addr, then sets status = DD (Descriptor Done) to signal the driver.

IRQ → NAPI Scheduling

The hardware interrupt tells the CPU a packet has arrived. The interrupt service routine (igb_intr()) does the minimum possible work: disable further NIC interrupts and schedule the NAPI softirq. All actual packet processing happens later in softirq context — outside the interrupt handler.

ConceptDetail
DMANIC writes payload directly into a pre-allocated kernel buffer — the CPU does not participate in the copy. Zero CPU cycles consumed.
DD bitDescriptor Done: NIC sets status=DD after the DMA write completes. The driver polls this bit to discover which descriptors hold new data.
ISRigb_intr() does minimal work only: disable NIC IRQ + call napi_schedule(). No packet is processed inside the interrupt handler.
NAPINew API: instead of one IRQ per packet, NAPI batch-polls up to budget packets per softirq invocation — critical at 10 Gbps rates.
Why disable NIC IRQ immediatelyAt 10 Gbps, millions of packets per second arrive. Without disabling IRQs, the CPU would spend 100% of its time in interrupt handlers, starving all other work.

Interview Prep

QuestionAnswer
What does the NIC write into the RX ring buffer? Who allocates it?The NIC DMA-writes the raw packet payload into a kernel buffer pointed to by buf_addr. The driver (igb) pre-allocates these buffers at initialization and loads their physical addresses into the descriptors.
Why does the driver ISR disable NIC interrupts immediately?To prevent an interrupt storm. At high packet rates the NIC would raise IRQs faster than the CPU can handle them. Disabling the IRQ and switching to NAPI's polled batch mode keeps the CPU usable.
What is NAPI and how does it differ from the old per-packet interrupt model?NAPI (New API) uses a softirq to poll the RX ring for up to budget packets per invocation, re-enabling the NIC IRQ only when the ring is drained. The old model raised one hardware IRQ per packet — untenable at 10+ Gbps.

2. § 35.2 — NAPI Softirq: Polling the RX Ring

The hardware interrupt only schedules NAPI. The real work — reading every filled descriptor and building an sk_buff per packet — happens in softirq context via igb_poll(). A budget cap ensures the softirq doesn't monopolize the CPU.

NAPI Poll Loop

StepFunctionEffect
1net_rx_action()Softirq handler; iterates registered NAPI structs, enforces global budget
2napi_poll(n, budget)Calls the driver's poll fn with remaining budget
3igb_poll()Reads RX ring: checks DD bit, wraps DMA buffer in sk_buff, refills descriptor
4napi_gro_receive()GRO coalescing: merges consecutive TCP segments with the same 5-tuple to reduce upper-layer calls
5netif_receive_skb()Hands the completed sk_buff up to the protocol stack (L2 → L3 → L4)
6napi_complete_done()Called when processed < budget: marks NAPI idle and re-enables NIC IRQ

sk_buff Memory Layout

Each packet is wrapped in an sk_buff — a metadata descriptor, not a copy of the data. The data pointer marks the start of the current payload; tail marks the end. Moving these pointers (not memcpy-ing bytes) is how the kernel strips and prepends headers at each protocol layer.

Interactive: NAPI Budget Simulator

Adjust arrival rate and budget to see when NAPI exhausts the budget and reschedules (NIC IRQ stays disabled) versus drains the ring and re-enables the IRQ. Three consecutive budget-exhaustion ticks trigger the starvation warning.

Demo 3512 — NAPI Budget Simulator
Arrival Rate: 80 pkts/ms
1200
NAPI Budget: 64 pkts
1128
Packets processed per tick (last 20)
waiting for simulation…
IRQ re-enabled
NAPI rescheduled
starvation risk
Kernel log
Press Start to simulate NAPI polling.

3. § 35.3 — Network Stack: L2 → L3 → L4

Once netif_receive_skb() hands the sk_buff to the network stack, it travels through three protocol layers. Each layer strips its header (via skb_pull) and dispatches to the next. The journey ends when skb_queue_tail() places the skb on the socket's receive queue and sk_data_ready wakes any sleeping readers.

Protocol Stack Funnel

Socket Lookup: __inet_lookup_skb()

The kernel maintains a global hash table of all open TCP sockets keyed by the 4-tuple {saddr, daddr, sport, dport}. __inet_lookup_skb() hashes the incoming packet's 4-tuple, walks the collision chain in that bucket, and returns the matching struct sock *. If nothing matches, the kernel sends a TCP RST — the connection is unknown.

FunctionFileAction
netif_receive_skb()net/core/dev.cETH type dispatch — reads skb->protocol, calls registered L3 handler (ip_rcv for ETH_P_IP)
ip_rcv()net/ipv4/ip_input.cIP header validation (checksum, version, length), Netfilter PREROUTING hook, routing decision
tcp_v4_rcv()net/ipv4/tcp_ipv4.cTCP entry point; calls __inet_lookup_skb() to find the socket, drops or RSTs if none found
tcp_rcv_established()net/ipv4/tcp_input.cFast path for ESTABLISHED sockets: sequence number check, ACK processing, skb enqueue
skb_queue_tail()Appends skb to sk->sk_receive_queue under the socket lock; data is now available to read()
sk->sk_data_ready()net/core/sock.cFunction pointer — default sock_def_readable() — wakes processes sleeping on the socket wait queue

4. § 35.4 — sk_data_ready: Stack → Wait Queue

After skb_queue_tail() places data on the socket, sk->sk_data_ready(sk) fires. For TCP this is sock_def_readable(), which traverses the socket's wait queue. This single traversal is where select and epoll diverge — not in how they sleep, but in what the wakeup callback does.

sk_data_ready → Wait Queue Traversal

select vs epoll: Same Trigger, Different Callbacks

Both select and epoll register entries in the same socket wait queue. The difference is the func field: pollwake wakes the process to re-scan all N fds; ep_poll_callback inserts one epitem into the rdllist before waking — so the process already knows which fd is ready.

Interactive: Full Packet Stack Trace

Step through all 9 kernel stages from NIC DMA to read() returning. Toggle the mode at Step 6 to compare the select pollwake O(N) path against the epoll ep_poll_callback O(1) path.

Demo 3513 — Full Packet Stack Trace
mode:
struct rx_desc (RX Ring)
buf_addr0xffff8800 (kernel buffer)
length128 bytes
statusDD: 0 → 1
Action
NIC DMA-writes 128B payload into pre-allocated kernel buffer. Sets Descriptor Done bit. CPU not involved.
Function
igb hardware (DMA engine)
File
drivers/net/ethernet/intel/igb/
Process State
TASK_INTERRUPTIBLE
skb State
In NIC DMA buffer — sk_buff not yet created
Mem Change
rx_desc.status: DD=0 → DD=1
Step 0 / 8

5. § 35.5 — sk_buff Memory Layout

An sk_buff is a descriptor, not a copy of the packet. The actual bytes live in a single contiguous allocation. Two pointers — data and tail — mark the current payload boundaries. Every layer strips or prepends headers by moving these pointers, never by copying bytes. This is the kernel's zero-copy strategy.

Header Push / Pull at Each Layer

On the receive path, each layer calls skb_pull(n) to move data rightward past its header. On transmit, skb_push(n) moves data leftward into headroom to make room for the outgoing header. No memcpy involved either way.

Interactive: sk_buff Memory Explorer

Push and pull protocol headers to watch the data pointer move and len change in real time. Add payload to extend tail. The headroom region shrinks as headers are pushed; pulling them back reclaims that space.

Demo 3514 — sk_buff Explorer
0
head
64
data
128
tail
256
end
64
len
payload
data
tail
headroom
L2 ETH (14B)
L3 IP (20B)
L4 TCP (20B)
payload
tailroom
Kernel log
Initial: data=64, tail=128, len=64

Interview Prep — Must-Know Questions

QuestionKey answer
What is NAPI? Why disable the NIC IRQ after the first interrupt?NAPI (New API) is a batch-poll model: one hardware IRQ schedules a softirq that polls up to budget packets per invocation. The NIC IRQ is disabled immediately to prevent an interrupt storm — at 10 Gbps, millions of IRQs per second would starve the CPU.
What does sk_buff contain? Is it the packet data?sk_buff is a metadata descriptor only. The actual bytes sit in a separate contiguous buffer. The data pointer in sk_buff points into that buffer — no packet bytes live inside the sk_buff struct itself.
What do skb_push() and skb_pull() do? Why is this zero-copy?skb_push(n) moves the data pointer left by n bytes (prepends a header into headroom). skb_pull(n) moves it right (strips a header). No memcpy — pure pointer arithmetic on the same allocation.
Name every function from netif_receive_skb() to sk_receive_queue.netif_receive_skb() → ip_rcv() → ip_local_deliver_finish() → tcp_v4_rcv() → __inet_lookup_skb() → tcp_v4_do_rcv() → tcp_rcv_established() → skb_queue_tail(&sk->sk_receive_queue, skb)
What is sk_data_ready? Which function does TCP use by default?sk_data_ready is a function pointer on struct sock, called after skb is enqueued on sk_receive_queue. The default for TCP is sock_def_readable(), which calls wake_up_interruptible_sync_poll() on sk->sk_wq->wait.
How does the same wake_up() serve both select and epoll?Both register a wait_queue_entry_t in the socket's sk_wq, but with different func pointers. select installs pollwake() — wakes the process to re-scan all N fds (O(N)). epoll installs ep_poll_callback() — inserts one epitem into rdllist O(1) then wakes epoll_wait.
What is __inet_lookup_skb()? What is its lookup key?Finds the matching struct sock for an incoming packet. Key is the 4-tuple {saddr, daddr, sport, dport} — hashed into inet_hashinfo.ehash. Walks the collision chain per bucket. No socket found → kernel sends TCP RST.