17. IP & ICMP Deep Dive
IPv4 wire format, fragmentation arithmetic, ICMP error payloads, ping, traceroute, PMTUD, PLPMTUD, and ICMPv6.
1. Overview
IPv4 is the best-effort packet layer: it names endpoints, chooses a next hop, decrements TTL, and either forwards, fragments, or drops each datagram. ICMP is the control channel beside it, carrying errors such as TTL expiry, unreachable ports, and fragmentation-needed feedback.
The protocol field tells the receiver what payload parser comes next: ICMP is 1, TCP is 6, UDP is 17, OSPF is 89, and SCTP is 132. The header checksum covers only the IPv4 header, so every router must update it after TTL changes.
2. 17.1 - IPv4 Header Fields
The normal IPv4 header is 20 bytes because IHL is 5 32-bit words. Options can extend it to 60 bytes, but options are slow-path material in many routers. DSCP replaces the older ToS interpretation and maps packets to queueing behaviors such as EF, AF, BE, and CS classes. ECN uses the low two bits of the same byte so routers can mark congestion without dropping packets.
Fragmentation is controlled by three flag bits. DF tells routers to return an ICMP fragmentation-needed error instead of splitting the packet; MF tells the destination that more fragments are expected.
| Field | Operational meaning |
|---|---|
| IHL | Header length in 32-bit words; 5 means 20 bytes without options. |
| DSCP / ECN | Queueing policy plus congestion notification bits used by modern congestion control. |
| Identification | Shared by all fragments from one original datagram. |
| TTL | Loop limiter. Each hop decrements it; zero produces ICMP Time Exceeded. |
| Header checksum | IPv4 header only, recalculated at each hop after TTL changes. |
3. 17.2 - IP Fragmentation
IPv4 fragmentation splits one datagram when the next link MTU is smaller than the packet and DF is clear. Every fragment gets its own IPv4 header, the same Identification value, and an offset measured in 8-byte units. For a 4000-byte datagram over a 1500-byte MTU, the 3980-byte payload becomes two full 1480-byte fragments and one 1020-byte tail.
Reassembly happens only at the destination. The receiver groups fragments by source IP, destination IP, Identification, and protocol, then sorts by offset and waits until every byte range is present and an MF=0 tail fragment has arrived.
Fragmentation is fragile: losing one fragment loses the whole datagram, later fragments lack TCP or UDP ports, and overlapping fragments have a long security history. Modern stacks normally avoid it with DF and PMTUD.
Minimal C Demo - IP Fragmentation Animator
4. 17.3 - ICMP Types and Codes
ICMP error packets quote the original IPv4 header plus the first 8 bytes of the offending payload. For TCP and UDP this is enough to include source and destination ports, so the sender can map the error back to a socket or flow.
| Type | Code | Name | Use |
|---|---|---|---|
| 0 | 0 | Echo Reply | Reply to ping. |
| 3 | 0-15 | Destination Unreachable | Network, host, protocol, port, fragmentation-needed, or admin-prohibited failures. |
| 4 | 0 | Source Quench | Deprecated congestion signal. |
| 5 | 0-3 | Redirect | Router tells a host a better next hop on the local link. |
| 8 | 0 | Echo Request | Ping request. |
| 11 | 0-1 | Time Exceeded | TTL expired in transit or fragment reassembly timeout. |
| 12 | 0-2 | Parameter Problem | Bad IP header field. |
Routers and hosts rate-limit ICMP because error generation can amplify traffic. Linux exposes knobs such as net.ipv4.icmp_ratelimit. ICMP redirect, smurf amplification, and historical ping-of-death bugs are examples of why operators filter specific ICMP messages, but blanket filtering breaks PMTUD.
5. 17.4 - ping Internals
ping sends ICMP Echo Request, type 8 code 0, and expects Echo Reply, type 0 code 0. The identifier is often derived from the process ID so multiple ping processes can share the same host, and the sequence number detects loss or reordering.
RTT measurement is simple: put a timestamp in the request payload, receive the same bytes in the reply, then subtract. Classic ping used raw sockets; modern Linux can also allow unprivileged ICMP datagram sockets through net.ipv4.ping_group_range.
Minimal C Demo - ping Step-by-Step
6. 17.5 - traceroute Internals
Traceroute deliberately sends probes with TTL 1, 2, 3, and upward. Each router that decrements TTL to zero drops the probe and returns ICMP Time Exceeded, so the sender learns the address and latency of that hop.
- UDP traceroute uses high destination ports; final hop returns ICMP port unreachable.
- ICMP traceroute sends Echo Requests; final hop returns Echo Reply.
- TCP traceroute sends SYN packets; final hop returns SYN+ACK or RST and often passes firewalls that block UDP or ICMP probes.
- Paris traceroute keeps the flow hash stable so ECMP does not make consecutive probes appear to take unrelated paths.
Minimal C Demo - traceroute Simulator
7. 17.6 - PMTUD and PLPMTUD
Path MTU Discovery sends packets with DF=1. If a router sees a smaller next-hop MTU, it drops the packet and returns ICMP Destination Unreachable type 3 code 4 with the next-hop MTU. The sender lowers its PMTU cache and, for TCP, lowers MSS so later segments fit.
The classic blackhole is a firewall that drops ICMP fragmentation-needed messages. The sender keeps retransmitting packets that are too large, but never receives the signal needed to shrink them.
PLPMTUD moves discovery into the packetization layer. Instead of depending on ICMP, the transport sends probes of controlled sizes and treats successful acknowledgment as proof that the size works. QUIC relies on this style because it runs its transport machinery in user space over UDP. IPv6 makes this more important: routers never fragment packets, and ICMPv6 Packet Too Big is mandatory for normal operation.
8. 17.7 - ICMPv6
ICMPv6 is not just an error protocol. It carries IPv6 errors, ping, Neighbor Discovery, router discovery, redirects, and multicast listener signaling. Filtering ICMPv6 as if it were optional IPv4 ICMP breaks basic IPv6 functions such as address resolution and PMTUD.
The most important operational difference is fragmentation: IPv6 routers do not fragment. A source may fragment with an extension header, but routers only send Packet Too Big messages when the path MTU is exceeded.
9. Kernel Source Pointers
| Area | Linux files and functions |
|---|---|
| IPv4 input | net/ipv4/ip_input.c: ip_rcv, ip_local_deliver |
| IPv4 forwarding | net/ipv4/ip_forward.c: ip_forward |
| Fragmentation | net/ipv4/ip_fragment.c: reassembly queues; net/ipv4/ip_output.c: fragmentation output path |
| ICMP | net/ipv4/icmp.c: icmp_rcv, icmp_send |
| ICMPv6 | net/ipv6/icmp.c and net/ipv6/ndisc.c |
10. Interview Prep
| Question | Answer checkpoint |
|---|---|
| How does traceroute determine each hop? | It sends probes with increasing TTL; the router where TTL reaches zero returns ICMP Time Exceeded. |
| Why is ICMP type 3 code 4 critical? | It tells a DF sender that the packet exceeds the next-hop MTU, allowing PMTUD and MSS reduction. |
| How does IPv4 reassembly know when it is complete? | Fragments share src, dst, ID, and protocol; offsets cover all byte ranges and the final fragment has MF=0. |
| Why use TCP traceroute? | TCP SYN probes to ports like 80 or 443 may pass firewalls that drop UDP or ICMP probes. |
| How does IPv6 fragmentation differ? | Routers never fragment IPv6 packets; only sources fragment, and PMTUD uses ICMPv6 Packet Too Big. |