Part XI — TCP Reliability

§ 11 TCP Reliable Byte Stream

TCP reliability is the machinery between a byte stream API and an unreliable IP path: sliding windows, SACK scoreboards, RTT/RTO estimation, fast loss recovery, tiny-packet controls, keepalive, PMTUD, PAWS, and TCP Fast Open.

1. Reliability Mental Model

The sender can transmit only while bytes in flight fit inside both the congestion window and the advertised receive window: bytes_in_flight <= min(cwnd, rwnd). Reliability then comes from ACK clocking, retransmission timers, SACK evidence, and conservative ambiguity handling after retransmits.

ConcernTCP statePurpose
Flow controlrwndReceiver protection: do not overrun the peer receive buffer.
Congestion controlcwndNetwork protection: do not inject more than the path can carry.
ReliabilitySND.UNA / SND.NXTTrack oldest unacknowledged byte and next byte to transmit.
Loss recoverySACK scoreboard / RACKIdentify holes or time-late packets without waiting for RTO.
TimersRTO / persist / keepaliveRecover loss, avoid zero-window deadlock, and detect dead peers.

2. § 11.1 — Sliding Window and Flow Control

The TCP send window is byte-oriented, not packet-oriented. SND.UNA points at the oldest unacknowledged byte, SND.NXT points at the next byte to send, and the usable window is what remains before the smaller of cwnd and rwnd closes.

Window scaling is negotiated only in the SYN exchange. Without it, the 16-bit TCP window field caps the receive window near 64 KiB; with a shift up to 14, TCP can fill high-bandwidth, high-latency paths that need hundreds of megabytes in flight.

  • Zero-window probes are driven by the persist timer so a lost window update cannot deadlock the connection forever.
  • Silly window syndrome is avoided by Nagle at the sender and Clark's algorithm at the receiver.
  • The receive window is flow control; it says nothing about whether the network path is congested.

Minimal C Demo — Sliding Window Animator

Sliding Window Animator — C Demo
stdin (optional)

3. § 11.2 — SACK and D-SACK

Cumulative ACKs report only the next missing byte. SACK adds explicit non-contiguous received ranges, letting the sender maintain a scoreboard and retransmit holes rather than falling back to go-back-N behavior after multiple losses in one window.

D-SACK uses the first SACK block to describe a duplicate segment. That gives the sender evidence that a retransmission was spurious, often caused by ACK reordering or an overly aggressive timeout.

  • The TCP option space allows at most 40 bytes, so a SACK option can carry only a small number of left-edge and right-edge block pairs.
  • Linux keeps the SACK scoreboard in the write queue tagging path around tcp_sacktag_write_queue().
  • RACK builds on SACK timing by treating recently delivered later packets as evidence that older packets may be lost.

Minimal C Demo — SACK Gap Simulation

SACK Gap Simulation — C Demo
stdin (optional)

4. § 11.3 — RTT Measurement and RTO

TCP estimates retransmission timeout from a smoothed RTT and a variance band. The variance term matters because a path with jitter needs a larger safety margin than a stable low-variance path with the same average RTT.

Karn's algorithm excludes retransmitted segments from RTT sampling because the returning ACK might acknowledge either the original segment or the retransmission. TCP timestamps reduce that ambiguity and also support PAWS, which rejects old timestamped segments after 32-bit sequence numbers wrap on fast links.

Minimal C Demo — RTT/RTO Calculator

RTT/RTO Calculator — C Demo
stdin (optional)

5. § 11.4 — Fast Retransmit and RACK

Fast retransmit avoids waiting for the RTO when duplicate ACKs strongly imply a missing segment. Three duplicate ACKs are used because one or two duplicates can be ordinary packet reordering; three is a stronger loss signal.

RACK changes the signal from duplicate ACK count to time. If a later packet is SACKed and an older packet has remained unacknowledged longer than the recent RTT plus a reordering window, RACK can mark the older packet lost. This handles tail losses and reordering better than pure dupACK counting.

  • Reno fast recovery halves the window, retransmits the missing segment, inflates on extra dupACKs, then deflates when recovery completes.
  • NewReno stays in recovery through partial ACKs until every lost segment from that window is acknowledged.
  • RACK plus Tail Loss Probe reduces dependence on long RTO waits at the end of a flight.

6. § 11.5 — Nagle, Delayed ACK, and TCP Cork

Nagle buffers small writes while earlier data is unacknowledged. Delayed ACK waits briefly to piggyback an ACK on response data. Each is reasonable alone, but together they can insert a visible delay when an application writes a tiny header, then a tiny body, then waits.

  • TCP_NODELAY disables Nagle and is common for latency-sensitive interactive protocols.
  • TCP_QUICKACK asks Linux to ACK quickly for a short period, not forever.
  • TCP_CORK is application-controlled coalescing, useful for sending headers and body as a full segment.

Minimal C Demo — Nagle / Delayed ACK Interaction

Nagle / Delayed ACK Interaction — C Demo
stdin (optional)

7. § 11.6 — Keepalive, PMTUD, PAWS, and TFO

Keepalive is dead-peer detection for otherwise idle connections, not an application heartbeat. Linux defaults are intentionally slow: after about two hours idle, probes are sent at intervals, and repeated failure tears the connection down.

PMTUD avoids fragmentation by setting DF and listening for ICMP fragmentation-needed feedback. If those ICMP messages are filtered, TCP can blackhole large packets until packetization-layer probing or Linux tcp_mtu_probing discovers a smaller safe size.

TCP Fast Open can send data in the SYN after the client has a valid server cookie. The replay risk is the key design constraint: only idempotent or replay-safe requests should use early data.

Minimal C Demo — PMTUD and TFO Decision Trace

PMTUD and TCP Fast Open — C Demo
stdin (optional)

8. Kernel Source Pointers

  • net/ipv4/tcp_input.c: ACK processing, SACK tagging, RACK loss detection, PAWS checks.
  • net/ipv4/tcp_output.c: retransmit output, zero-window probes, TFO SYN-data paths.
  • net/ipv4/tcp_timer.c: retransmission, persist, delayed ACK, and keepalive timers.
  • include/net/tcp.h: send/receive window helpers and core TCP control block fields.
  • net/ipv4/tcp_recovery.c: RACK and loss recovery logic in modern Linux.

9. Interview Prep

QuestionConcise answer
What is the Nagle deadlock?Nagle holds a small write while delayed ACK waits before acknowledging the prior tiny segment; TCP_NODELAY removes the sender-side hold.
How does a SACK scoreboard work?It records received byte ranges from SACK blocks and leaves only the missing holes eligible for retransmission.
Why does Karn exclude retransmitted RTT samples?The ACK is ambiguous: it might correspond to the original segment or the retransmission, so using it would corrupt SRTT.
How is RACK different from dupACK loss detection?RACK uses time since send and evidence that later packets arrived; it does not require three duplicate ACKs.
What is TFO's security limitation?Data in SYN can be replayed, so early data should be limited to idempotent or replay-safe operations.