§ 35 — Packet Path: NIC → Kernel Socket Buffer
NIC RX ring: DMA descriptor + DD bit + zero-copy (§35.1) · NAPI softirq: igb_poll() budget loop + GRO coalescing (§35.2) · Protocol stack: netif_receive_skb → ip_rcv → tcp_v4_rcv → sk_receive_queue (§35.3) · sk_data_ready → wait queue: pollwake O(N) vs ep_poll_callback O(1) (§35.4) · sk_buff memory layout: head/data/tail/end · skb_push / skb_pull zero-copy (§35.5)
1. § 35.1 — NIC Receives a Packet: DMA & IRQ
Before any kernel networking code runs, the NIC hardware has already written the packet into memory. The mechanism is DMA (Direct Memory Access): the NIC writes directly into a pre-allocated kernel buffer with zero CPU involvement. Only after the transfer completes does the CPU get involved — via a hardware interrupt.
NIC RX Descriptor Ring
The driver pre-allocates a circular array of descriptors. Each descriptor holds a buf_addr (pointer to a kernel buffer), a len field, and a status field. When a packet arrives, the NIC DMA-writes the payload directly into buf_addr, then sets status = DD (Descriptor Done) to signal the driver.
IRQ → NAPI Scheduling
The hardware interrupt tells the CPU a packet has arrived. The interrupt service routine (igb_intr()) does the minimum possible work: disable further NIC interrupts and schedule the NAPI softirq. All actual packet processing happens later in softirq context — outside the interrupt handler.
| Concept | Detail |
|---|---|
| DMA | NIC writes payload directly into a pre-allocated kernel buffer — the CPU does not participate in the copy. Zero CPU cycles consumed. |
| DD bit | Descriptor Done: NIC sets status=DD after the DMA write completes. The driver polls this bit to discover which descriptors hold new data. |
| ISR | igb_intr() does minimal work only: disable NIC IRQ + call napi_schedule(). No packet is processed inside the interrupt handler. |
| NAPI | New API: instead of one IRQ per packet, NAPI batch-polls up to budget packets per softirq invocation — critical at 10 Gbps rates. |
| Why disable NIC IRQ immediately | At 10 Gbps, millions of packets per second arrive. Without disabling IRQs, the CPU would spend 100% of its time in interrupt handlers, starving all other work. |
Interview Prep
| Question | Answer |
|---|---|
| What does the NIC write into the RX ring buffer? Who allocates it? | The NIC DMA-writes the raw packet payload into a kernel buffer pointed to by buf_addr. The driver (igb) pre-allocates these buffers at initialization and loads their physical addresses into the descriptors. |
| Why does the driver ISR disable NIC interrupts immediately? | To prevent an interrupt storm. At high packet rates the NIC would raise IRQs faster than the CPU can handle them. Disabling the IRQ and switching to NAPI's polled batch mode keeps the CPU usable. |
| What is NAPI and how does it differ from the old per-packet interrupt model? | NAPI (New API) uses a softirq to poll the RX ring for up to budget packets per invocation, re-enabling the NIC IRQ only when the ring is drained. The old model raised one hardware IRQ per packet — untenable at 10+ Gbps. |
2. § 35.2 — NAPI Softirq: Polling the RX Ring
The hardware interrupt only schedules NAPI. The real work — reading every filled descriptor and building an sk_buff per packet — happens in softirq context via igb_poll(). A budget cap ensures the softirq doesn't monopolize the CPU.
NAPI Poll Loop
| Step | Function | Effect |
|---|---|---|
| 1 | net_rx_action() | Softirq handler; iterates registered NAPI structs, enforces global budget |
| 2 | napi_poll(n, budget) | Calls the driver's poll fn with remaining budget |
| 3 | igb_poll() | Reads RX ring: checks DD bit, wraps DMA buffer in sk_buff, refills descriptor |
| 4 | napi_gro_receive() | GRO coalescing: merges consecutive TCP segments with the same 5-tuple to reduce upper-layer calls |
| 5 | netif_receive_skb() | Hands the completed sk_buff up to the protocol stack (L2 → L3 → L4) |
| 6 | napi_complete_done() | Called when processed < budget: marks NAPI idle and re-enables NIC IRQ |
sk_buff Memory Layout
Each packet is wrapped in an sk_buff — a metadata descriptor, not a copy of the data. The data pointer marks the start of the current payload; tail marks the end. Moving these pointers (not memcpy-ing bytes) is how the kernel strips and prepends headers at each protocol layer.
Interactive: NAPI Budget Simulator
Adjust arrival rate and budget to see when NAPI exhausts the budget and reschedules (NIC IRQ stays disabled) versus drains the ring and re-enables the IRQ. Three consecutive budget-exhaustion ticks trigger the starvation warning.
3. § 35.3 — Network Stack: L2 → L3 → L4
Once netif_receive_skb() hands the sk_buff to the network stack, it travels through three protocol layers. Each layer strips its header (via skb_pull) and dispatches to the next. The journey ends when skb_queue_tail() places the skb on the socket's receive queue and sk_data_ready wakes any sleeping readers.
Protocol Stack Funnel
Socket Lookup: __inet_lookup_skb()
The kernel maintains a global hash table of all open TCP sockets keyed by the 4-tuple {saddr, daddr, sport, dport}. __inet_lookup_skb() hashes the incoming packet's 4-tuple, walks the collision chain in that bucket, and returns the matching struct sock *. If nothing matches, the kernel sends a TCP RST — the connection is unknown.
| Function | File | Action |
|---|---|---|
| netif_receive_skb() | net/core/dev.c | ETH type dispatch — reads skb->protocol, calls registered L3 handler (ip_rcv for ETH_P_IP) |
| ip_rcv() | net/ipv4/ip_input.c | IP header validation (checksum, version, length), Netfilter PREROUTING hook, routing decision |
| tcp_v4_rcv() | net/ipv4/tcp_ipv4.c | TCP entry point; calls __inet_lookup_skb() to find the socket, drops or RSTs if none found |
| tcp_rcv_established() | net/ipv4/tcp_input.c | Fast path for ESTABLISHED sockets: sequence number check, ACK processing, skb enqueue |
| skb_queue_tail() | — | Appends skb to sk->sk_receive_queue under the socket lock; data is now available to read() |
| sk->sk_data_ready() | net/core/sock.c | Function pointer — default sock_def_readable() — wakes processes sleeping on the socket wait queue |
4. § 35.4 — sk_data_ready: Stack → Wait Queue
After skb_queue_tail() places data on the socket, sk->sk_data_ready(sk) fires. For TCP this is sock_def_readable(), which traverses the socket's wait queue. This single traversal is where select and epoll diverge — not in how they sleep, but in what the wakeup callback does.
sk_data_ready → Wait Queue Traversal
select vs epoll: Same Trigger, Different Callbacks
Both select and epoll register entries in the same socket wait queue. The difference is the func field: pollwake wakes the process to re-scan all N fds; ep_poll_callback inserts one epitem into the rdllist before waking — so the process already knows which fd is ready.
Interactive: Full Packet Stack Trace
Step through all 9 kernel stages from NIC DMA to read() returning. Toggle the mode at Step 6 to compare the select pollwake O(N) path against the epoll ep_poll_callback O(1) path.
5. § 35.5 — sk_buff Memory Layout
An sk_buff is a descriptor, not a copy of the packet. The actual bytes live in a single contiguous allocation. Two pointers — data and tail — mark the current payload boundaries. Every layer strips or prepends headers by moving these pointers, never by copying bytes. This is the kernel's zero-copy strategy.
Header Push / Pull at Each Layer
On the receive path, each layer calls skb_pull(n) to move data rightward past its header. On transmit, skb_push(n) moves data leftward into headroom to make room for the outgoing header. No memcpy involved either way.
Interactive: sk_buff Memory Explorer
Push and pull protocol headers to watch the data pointer move and len change in real time. Add payload to extend tail. The headroom region shrinks as headers are pushed; pulling them back reclaims that space.
Interview Prep — Must-Know Questions
| Question | Key answer |
|---|---|
| What is NAPI? Why disable the NIC IRQ after the first interrupt? | NAPI (New API) is a batch-poll model: one hardware IRQ schedules a softirq that polls up to budget packets per invocation. The NIC IRQ is disabled immediately to prevent an interrupt storm — at 10 Gbps, millions of IRQs per second would starve the CPU. |
| What does sk_buff contain? Is it the packet data? | sk_buff is a metadata descriptor only. The actual bytes sit in a separate contiguous buffer. The data pointer in sk_buff points into that buffer — no packet bytes live inside the sk_buff struct itself. |
| What do skb_push() and skb_pull() do? Why is this zero-copy? | skb_push(n) moves the data pointer left by n bytes (prepends a header into headroom). skb_pull(n) moves it right (strips a header). No memcpy — pure pointer arithmetic on the same allocation. |
| Name every function from netif_receive_skb() to sk_receive_queue. | netif_receive_skb() → ip_rcv() → ip_local_deliver_finish() → tcp_v4_rcv() → __inet_lookup_skb() → tcp_v4_do_rcv() → tcp_rcv_established() → skb_queue_tail(&sk->sk_receive_queue, skb) |
| What is sk_data_ready? Which function does TCP use by default? | sk_data_ready is a function pointer on struct sock, called after skb is enqueued on sk_receive_queue. The default for TCP is sock_def_readable(), which calls wake_up_interruptible_sync_poll() on sk->sk_wq->wait. |
| How does the same wake_up() serve both select and epoll? | Both register a wait_queue_entry_t in the socket's sk_wq, but with different func pointers. select installs pollwake() — wakes the process to re-scan all N fds (O(N)). epoll installs ep_poll_callback() — inserts one epitem into rdllist O(1) then wakes epoll_wait. |
| What is __inet_lookup_skb()? What is its lookup key? | Finds the matching struct sock for an incoming packet. Key is the 4-tuple {saddr, daddr, sport, dport} — hashed into inet_hashinfo.ehash. Walks the collision chain per bucket. No socket found → kernel sends TCP RST. |