Deep Dive — Study Catalog
26 parts · 300+ sections · CCNP/CCIE routing & switching, TCP/IP & congestion control, QUIC/KCP, IPv6, multicast, NAT traversal, HA, RDMA/RoCE, AI training fabrics, SPDK, libev and the full select/poll/epoll path.
Reference Books
| Code | Book | Focus |
|---|---|---|
TCPIP1 | TCP/IP Illustrated Vol.1 (2nd ed.) — Fall & Stevens | Wire-level walk of every protocol |
TCPIP2 | TCP/IP Illustrated Vol.2 — Wright & Stevens | BSD kernel implementation |
TCPIP3 | TCP/IP Illustrated Vol.3 — Stevens | T/TCP, HTTP, NNTP — historical |
UNP | UNIX Network Programming Vol.1 (3rd ed.) — Stevens | Sockets API bible — select/poll/epoll patterns |
KUROSE | Computer Networking: A Top-Down Approach — Kurose & Ross | University-level CN reference |
DOYLE1 | Routing TCP/IP Vol.1 (2nd ed.) — Jeff Doyle | IGPs — RIP, EIGRP, OSPF, IS-IS |
DOYLE2 | Routing TCP/IP Vol.2 — Jeff Doyle | BGP, multicast, NAT, IPv6 |
HALABI | Internet Routing Architectures (2nd ed.) — Sam Halabi | BGP design — the BGP bible |
MPLS | MPLS-Enabled Applications — Minei & Lucek | MPLS, L3VPN, EVPN, SR |
IPV6 | IPv6 Essentials (3rd ed.) — Silvia Hagen | Practical IPv6 reference |
MCAST | Developing IP Multicast Networks — Beau Williamson | PIM-SM, MSDP, RP design |
HPBN | High Performance Browser Networking — Ilya Grigorik | TLS, HTTP/2, HTTP/3, WebRTC |
CCNP | CCNP Enterprise ENCOR / ENARSI Official Cert Guides — Cisco Press | CCNP exam-aligned, hands-on |
CCIE | CCIE Enterprise Infrastructure / SP / DC v1.x Blueprint — Cisco | Lab-aligned curriculum |
IBTA | InfiniBand Architecture Spec Vol.1 & 2 — IBTA | RDMA verbs, transports, packet format |
MELLA | RDMA Aware Networks Programming Manual — NVIDIA/Mellanox | Verbs API and ConnectX behavior |
NCCL | NCCL Documentation & Source — NVIDIA | Collective algorithms, channels, transports |
SPDK | SPDK Programmer's Guide — Intel/Linux Foundation | User-space NVMe & NVMe-oF target |
LIBEV | libev Documentation — Marc Lehmann | Definitive source on libev internals |
LIBEVT | The libevent Book — Nick Mathewson | Event-loop design patterns |
TLPI | The Linux Programming Interface — Michael Kerrisk | select/poll/epoll/io_uring chapters |
Part I — Cisco CCNP / CCIE Foundations
Doyle Vol.1 · CCNP ENCOR · CCIE EI v1.1 blueprint
- §1.1OSI vs TCP/IP Layered Model (encapsulation, MTU/MSS, fragmentation, ethertype)
- §1.2Cisco IOS / IOS-XE / NX-OS / IOS-XR (CLI modes, AAA, SSH, NETCONF/YANG)
- §1.3Certification Path (CCNA → CCNP Enterprise → CCIE EI; CCNP DC → CCIE DC; CCNP SP → CCIE SP)
- §1.4Lab Tooling (GNS3, EVE-NG, Cisco CML/VIRL, Containerlab, Packet Tracer)
- §1.5Network Automation (Ansible, NETCONF, RESTCONF, gNMI, Cisco DNA Center)
- §1.6Reading the CCIE Lab Topology (control plane vs data plane, stateful vs stateless)
Part II — Layer 2 Switching
CCNP ENCOR · Doyle Vol.1 Ch.4-7
- §2.1Ethernet Frame & MAC Forwarding (CAM/TCAM, MAC aging, unicast flooding)
- §2.2VLAN (access/trunk, native VLAN, 802.1Q tagging, voice VLAN, QinQ stacking)
- §2.3VTP (server/client/transparent, pruning, VTPv3 password)
- §2.4STP / RSTP / MSTP (root election, BPDU, port states, edge/PortFast, BPDU guard, root guard, loop guard)
- §2.5EtherChannel / LACP (active/passive, PAgP auto/desirable, load-balance hash, min-links)
- §2.6Switchport Security (port-security, DHCP snooping, DAI, IPSG, storm-control)
- §2.7First-Hop Security (RA Guard, IPv6 Source Guard, BPDU Guard for IPv6)
- §2.8TRILL / SPB (modern L2 multipath alternatives to STP)
Part III — Stack & De-stack Architectures
Cisco StackWise / VSS / vPC / MLAG · interview-critical
- §3.1StackWise / StackWise-480 / StackWise-1T (Catalyst 3650/9300/9500 ring, master/standby/member)
- §3.2StackWise Virtual / SVL (Catalyst 9500/9600 — two physical switches as one logical)
- §3.3VSS (Virtual Switching System on Catalyst 6500/6800 — VSL link, dual-active detection, RPR/SSO)
- §3.4vPC (Nexus 9K/7K Virtual Port Channel — peer-link, peer-keepalive, orphan ports, vPC roles)
- §3.5MLAG / MC-LAG (Arista, Juniper MC-LAG, peer-gateway, ARP sync)
- §3.6De-stack Architecture: Pure L3 ECMP Spine-Leaf (no MLAG, BGP-only, server multi-homing via routing)
- §3.7Failure Domain Comparison (stack vs MLAG vs vPC vs L3-only)
- §3.8ISSU / GIR (in-service software upgrade, Graceful Insertion & Removal)
- §3.9Dual-Active / Split-Brain Detection (PAGP enhanced, fast-hello, BFD-based detection)
- §3.10Migration Stories (StackWise → SVL → leaf-spine — operational lessons)
Part IV — Routing Protocols — IGP
Doyle Vol.1 · CCIE Routing TCP/IP
- §4.1Static & Floating Static Routes (administrative distance, recursive lookup, IP SLA tracking)
- §4.2RIPv2 / RIPng (distance vector, split horizon, poison reverse, hold-down — legacy but useful baseline)
- §4.3OSPFv2 (LSA Type 1-7, areas, NSSA / Totally NSSA, virtual link, DR/BDR election, SPF & iSPF)
- §4.4OSPFv3 (per-link LSAs, IPv6 transport, address-family for v4)
- §4.5OSPF Optimization (LSA throttle, SPF throttle, prefix suppression, fast-hello, BFD)
- §4.6EIGRP (DUAL algorithm, feasible successor, FD/RD, named mode, classic vs wide metric)
- §4.7IS-IS (Level-1/Level-2, NET addressing, CSNP/PSNP, wide metric, multi-topology — common in ISP/DC)
- §4.8Route Redistribution (mutual redistribution, route-map, tag-based loop prevention)
- §4.9Policy Routing (PBR, route-map on interface, BFD-tracked next-hop)
- §4.10Convergence Tuning (BFD sub-second, fast-hello, LFA, RLFA, TI-LFA — all SR-aware)
Part V — Routing Protocols — BGP
Halabi · RFC 4271/7606 · CCIE SP / DC
- §5.1BGP Fundamentals (eBGP vs iBGP, full mesh, AS, TCP/179)
- §5.2Path Attributes (AS_PATH, NEXT_HOP, LOCAL_PREF, MED, ORIGIN, COMMUNITIES, AGGREGATOR)
- §5.3Best Path Selection (13-step algorithm — weight → local-pref → AS-path → origin → MED → eBGP/iBGP → IGP → router-id)
- §5.4Route Reflector & Confederation (scaling iBGP)
- §5.5BGP Communities (well-known, extended, large communities, community-based policy)
- §5.6BGP Security (RPKI/ROA, BGPsec, max-prefix, TTL security, GTSM)
- §5.7BGP Convergence (BGP-PIC, ADD-PATH, BFD, graceful restart)
- §5.8Multipath BGP (eBGP multipath, iBGP multipath, AS-PATH multipath-relax)
- §5.9BGP for Data Center (eBGP unnumbered, allowas-in, FRR, GoBGP, Bird)
- §5.10BGP Monitoring (BMP RFC 7854, gNMI streaming telemetry)
Part VI — MPLS, VPN & Segment Routing
MPLS-Enabled Applications (Minei) · CCIE SP
- §6.1MPLS Forwarding (label stack, EXP/TC, S bit, TTL, PHP)
- §6.2Label Distribution (LDP, RSVP-TE, BGP-LU, downstream unsolicited vs on-demand)
- §6.3MPLS L3VPN / RFC 4364 (VRF, RD, RT, MP-BGP VPNv4/v6, PE-CE protocols)
- §6.4MPLS L2VPN (VPWS pseudo-wire, VPLS, H-VPLS, control word)
- §6.5EVPN (RFC 7432 — Type 1-5 routes, VXLAN/MPLS data plane, EVPN-VPWS)
- §6.6Traffic Engineering (RSVP-TE, FRR/MPLS-FRR, auto-bandwidth)
- §6.7Segment Routing — SR-MPLS (SID, prefix-SID, adj-SID, anycast SID, IGP shortest path)
- §6.8SRv6 (locator + function + arg, micro-SID/uSID, end.x, end.dt4/dt6, SRv6 BE vs Policy)
- §6.9TI-LFA (Topology-Independent LFA — sub-50ms protection over SR)
- §6.10VXLAN (RFC 7348, head-end replication, flood-and-learn vs EVPN control plane)
- §6.11Geneve / NVGRE / STT (alternative overlays — Geneve is the modern winner)
Part VII — Campus Network Design
Cisco SAFE · CCNP ENCOR / ENARSI
- §7.1Three-Tier Hierarchy (access / distribution / core)
- §7.2Two-Tier Collapsed Core (medium campus, distribution = core)
- §7.3SD-Access (LISP control plane, VXLAN data plane, ISE for policy, fabric edge/border/control)
- §7.4802.1X / MAB / Profiling (Cisco ISE, dot1x event-driven, change-of-authorization)
- §7.5Wireless Integration (CAPWAP local-mode vs FlexConnect, WLC, anchor controller, AP groups)
- §7.6Wi-Fi 6 / 6E / 7 (OFDMA, BSS coloring, 6GHz channels, MLO multi-link operation)
- §7.7PoE / PoE+ / UPoE / 802.3bt (15.4 W → 90 W, LLDP power negotiation)
- §7.8Campus Multicast (PIM-SM, IGMP snooping, AutoRP / BSR)
- §7.9Campus QoS (DSCP marking trust boundary, queueing on access/uplink)
- §7.10Cisco DNA Center / Catalyst Center (assurance, automation, fabric provisioning)
Part VIII — Data Center & Cloud Network
RFC 7938 · Clos · ACI / NSX-T / OVN · interview-critical
- §8.1Spine-Leaf Clos Topology (k-ary fat-tree, ECMP, oversubscription ratio, rail design)
- §8.2BGP-Only Underlay (RFC 7938, eBGP unnumbered, allowas-in, ECMP load-balance)
- §8.3EVPN-VXLAN Overlay (Type 2 MAC/IP, Type 3 IMET, Type 5 IP prefix, anycast gateway)
- §8.4Cisco ACI (APIC controller, EPG, contracts, bridge domain, VRF, policy model)
- §8.5NSX-T (T0/T1 routers, segments, distributed firewall, GENEVE)
- §8.6Open vSwitch / OVN (flow tables, OpenFlow, OVN northbound DB, logical routers/switches)
- §8.7Container CNI (Calico BGP, Cilium eBPF, Flannel VXLAN, Antrea, Multus)
- §8.8Hyperscale Fabrics (Facebook F16/Disaggregated Backbone, Google Jupiter, Azure SONiC)
- §8.9AWS VPC Internals (mapping service, ENI, Hyperplane, GWLB, Transit Gateway)
- §8.10Lossless DC Fabric for RoCE (PFC, ECN, DCQCN, headroom buffering, Spectrum-X)
- §8.11DCI (Data Center Interconnect — OTV, EVPN-VXLAN over DWDM, VPLS legacy)
Part IX — Service Provider / ISP Network
Doyle Vol.2 · CCIE SP · MEF · ITU-T
- §9.1ISP Topology (PE-P-PE backbone, ASBR, route reflector hierarchy, IGP design)
- §9.2Internet Peering (transit, settlement-free peering, full table vs partial vs default-only)
- §9.3IXP (Internet Exchange Points — DE-CIX, AMS-IX, route servers, BGP Looking Glass)
- §9.4BGP/MPLS L3VPN at Carrier Scale (RD/RT design, inter-AS option A/B/C, CSC)
- §9.56PE / 6VPE (carrying IPv6 over an MPLS IPv4 core)
- §9.6Carrier Ethernet & MEF (E-Line, E-LAN, E-Tree, E-Access, OAM 802.1ag/Y.1731)
- §9.7Optical Transport (SDH/SONET legacy, OTN ODU/OTU, DWDM, ROADM, coherent optics)
- §9.85G Transport (xHaul: fronthaul/midhaul/backhaul, eCPRI, Slicing IETF / SR-aware)
- §9.9SD-WAN (Cisco Viptela/Meraki, VMware Velocloud, Versa, Fortinet — overlay tunneling)
- §9.10NFV (VNF chaining, NFV-MANO, OpenStack Tacker, ONAP)
- §9.11Anycast Services (DNS root servers, CDN POPs, BGP-injected /32-/24)
- §9.12DDoS Mitigation (BGP Flowspec, RTBH, scrubbing centers, BCP38)
Part X — TCP State Machine & Connection Lifecycle
TCP/IP Illustrated Vol.1 Ch.11-13 · RFC 9293 · interview-critical
- §10.1TCP Header Anatomy (seq, ack, flags S/A/F/R/P/U/E/C, window, urg-ptr, options TLV)
- §10.2TCP State Machine — 11 States (CLOSED, LISTEN, SYN-SENT, SYN-RCVD, ESTABLISHED, FIN-WAIT-1/2, CLOSE-WAIT, CLOSING, LAST-ACK, TIME-WAIT)
- §10.3Three-Way Handshake (SYN → SYN-ACK → ACK; ISN selection, RFC 6528 hash-based)
- §10.4Four-Way Close (FIN, half-close, simultaneous close → CLOSING)
- §10.5TIME_WAIT Deep Dive (2*MSL rationale: orphan dup segments + reliable last ACK; tw_reuse vs tw_recycle hazards)
- §10.6SYN Flood & SYN Cookie (encode mss/wscale/sack into ISN via MD5; tradeoffs — no TS/SACK on cookie path)
- §10.7SYN Queue vs Accept Queue (somaxconn, tcp_max_syn_backlog, listen() backlog meaning)
- §10.8RST Handling & RST Attacks (blind reset, off-path attacker, RFC 5961 challenge ACK)
- §10.9Half-Open / Half-Closed Connections (keepalive vs application heartbeat to detect)
- §10.10Connection Table Sizing (ephemeral port range, conntrack table, source port reuse with SO_REUSEADDR)
- §10.11Linux TCP Tracing (ss -tnipo, /proc/net/tcp, tcptrace, bpftrace tcp:* tracepoints)
Part XI — TCP Reliability & Flow Control
TCP/IP Illustrated Vol.1 Ch.14-19 · RFC 5681/6675/9293
- §11.1Sliding Window — Sender (cwnd, snd.una, snd.nxt, snd.wnd; pipe = bytes in flight)
- §11.2Sliding Window — Receiver (rcv.wnd advertise, zero-window probe, silly-window-syndrome avoidance)
- §11.3Cumulative ACK vs SACK vs D-SACK (RFC 2018, RFC 2883 — duplicate detection)
- §11.4RTT & RTO Estimation (Karn's algorithm, Jacobson SRTT/RTTVAR, RFC 6298)
- §11.5Fast Retransmit & Fast Recovery (3 dup ACK trigger, NewReno partial ACK handling)
- §11.6RACK Loss Detection (RFC 8985 — time-based, replaces dup-ACK threshold)
- §11.7Nagle Algorithm vs TCP_NODELAY (small-packet coalescing trade-off, why interactive apps disable it)
- §11.8Delayed ACK (200ms typical, ACK-every-other-segment; the Nagle ↔ delayed-ACK 200ms stall classic)
- §11.9TCP_CORK / MSG_MORE (block partial sends until uncorked; sendfile + cork pattern for HTTP)
- §11.10TCP Keepalive (TCP_KEEPIDLE / KEEPINTVL / KEEPCNT; default 7200s — almost useless, app-layer heartbeat preferred)
- §11.11Path MTU Discovery (DF bit, ICMP need-frag, blackhole detection, PLPMTUD RFC 8899)
- §11.12TCP Fast Open (TFO cookie, 0-RTT data on subsequent connects, middlebox interference)
- §11.13Window Scaling, Timestamps, PAWS (RFC 7323 — large windows over high-BDP links)
- §11.14Urgent Pointer & OOB Data (legacy, broken interop — never use it)
- §11.15TCP Linger / SO_LINGER (close behavior, abortive vs graceful)
Part XII — TCP Congestion Control
RFC 5681/8312/9438 · BBR papers · interview-critical
- §12.1Framework (cwnd, ssthresh, slow start, congestion avoidance, AIMD, fast recovery)
- §12.2Tahoe (slow start + cong-avoid, no fast recovery — historical)
- §12.3Reno (fast retransmit + fast recovery on triple-dup-ACK)
- §12.4NewReno (RFC 6582 — handles multiple losses in one window without SACK)
- §12.5BIC (Binary Increase) — predecessor of CUBIC
- §12.6CUBIC (RFC 8312 — cubic-function cwnd vs time-since-loss; Linux default since 2.6.19)
- §12.7Westwood / Westwood+ (bandwidth estimate to set ssthresh after loss; for wireless lossy links)
- §12.8Vegas (delay-based, RTT increase = congestion; mostly displaced by BBR)
- §12.9Compound TCP (Microsoft — combines AIMD with delay component)
- §12.10BBR v1 (max-bw × min-RTT, 4-state machine: Startup/Drain/ProbeBW/ProbeRTT, pacing)
- §12.11BBR v2 / v3 (ECN integration, fairness with CUBIC, loss tolerance)
- §12.12DCTCP (RFC 8257 — ECN-marked fraction → multiplicative decrease; data center workhorse)
- §12.13PCC / Copa (utility-driven, machine-learned, no hand-tuned thresholds)
- §12.14ECN (RFC 3168, ECT/CE codepoint, AccECN RFC 9341) & AQM (RED, CoDel, FQ-CoDel, PIE)
- §12.15Bufferbloat & Pacing (TSO/GSO interaction with pacing, sch_fq pacing)
- §12.16Pluggable CC in Linux (net.ipv4.tcp_congestion_control, /proc/sys/net/ipv4/tcp_available_congestion_control)
- §12.17CC Selection Cheat Sheet (DC = DCTCP/BBRv2; long-haul = BBR; mobile = BBR/Westwood; small RTT LAN = CUBIC)
Part XIII — Modern Transports — QUIC, KCP, SCTP, MPTCP
RFC 9000-9002 · KCP / kcptun · interview-critical
- §13.1Why Move Off TCP — middlebox ossification, head-of-line blocking, slow handshake, no migration
- §13.2QUIC History (Google QUIC 2013 → IETF QUIC RFC 9000 in 2021)
- §13.3QUIC Wire Image (long header — Initial/Handshake/0-RTT/1-RTT; short header; CONNECTION_ID; varint encoding)
- §13.4QUIC Streams (per-stream flow control, no HoL between streams, server/client + uni/bidi 4 stream types)
- §13.5QUIC Crypto / TLS 1.3 Integration (CRYPTO frames, key updates, packet protection)
- §13.6QUIC Handshake (1-RTT new conn, 0-RTT with cached server config — replay-safety constraints)
- §13.7QUIC Connection Migration (stable Connection ID survives NAT rebinding / Wi-Fi → cellular)
- §13.8QUIC Loss Recovery (RFC 9002 — packet number monotonic, ACK frame with ranges, time threshold + packet threshold)
- §13.9QUIC Congestion Control (NewReno default, BBR/CUBIC pluggable, separate from TCP stack)
- §13.10HTTP/3 (QUIC + QPACK header compression — replaces HPACK; H3 frame layer)
- §13.11QUIC Implementations (Cloudflare quiche, Google quiche, Microsoft msquic, Linux kernel UDP GSO+GRO support)
- §13.12KCP — ARQ over UDP (selective repeat, fast retransmit on N skip, FEC, kcptun + crypto wrapping)
- §13.13KCP Tuning Knobs (nodelay, interval, resend, nc — interactive vs throughput presets)
- §13.14SCTP (RFC 9260 — multi-streaming, multi-homing, message-oriented; used in telecom signaling)
- §13.15MPTCP (RFC 8684 — multi-path TCP, subflows, scheduler, fallback to plain TCP, used by Apple Siri / Korea KT GiGA)
Part XIV — UDP & UDP-Based Protocols
RFC 768 · TCP/IP Illustrated Vol.1 Ch.10
- §14.1UDP Header (8 bytes — src/dst port, length, checksum); pseudo-header for checksum
- §14.2UDP Socket API (sendto/recvfrom, connect() on UDP for filtering, getsockname for ephemeral port)
- §14.3Batched I/O (recvmmsg / sendmmsg — single syscall for N packets)
- §14.4UDP-Lite (RFC 3828 — partial checksum coverage for tolerant codecs)
- §14.5UDP GSO / GRO (Linux 4.18+, segment offload for big UDP for QUIC throughput)
- §14.6SO_REUSEPORT (kernel hashing across listening sockets — UDP load balancing without an LB)
- §14.7UDP Fragmentation Risks (DF + ICMP needed; black-holed by middleboxes; QUIC restricts to PMTU)
- §14.8RTP / RTCP (real-time media, sequence + timestamp, SR/RR reports, SSRC)
- §14.9DTLS (TLS over UDP — used by WebRTC media, OpenVPN, CoAP)
Part XV — IP Layer & ICMP
RFC 791/792/4443 · TCP/IP Illustrated Vol.1 Ch.5-8
- §15.1IPv4 Header (version, IHL, TOS/DSCP/ECN, total len, ID/flags/frag-offset, TTL, proto, checksum, options)
- §15.2IP Fragmentation (DF flag, MF flag, identification, reassembly buffer, fragment overlap attacks)
- §15.3ICMP Types & Codes (Echo Request/Reply, Dest Unreachable codes, Time Exceeded, Redirect, Source Quench legacy)
- §15.4ICMP Probe Tools (ping, traceroute UDP/ICMP/TCP variants, MTR, paris-traceroute)
- §15.5ICMP Errors (orig packet header in payload, rate-limiting per RFC 4443)
- §15.6PMTUD via ICMP (Need Frag with next-hop MTU) — and why it often breaks (firewalls drop ICMP)
- §15.7PLPMTUD (RFC 8899 — search MTU at transport layer, no ICMP dependency, used by QUIC)
- §15.8ICMP Attacks (smurf, ping of death, ICMP redirect injection, ICMP tunneling)
- §15.9ICMPv6 (RFC 4443 — types reorganized, multicast for NDP/MLD, ICMPv6 essential not blockable)
Part XVI — ARP & Layer 2 Discovery
RFC 826/3927/5227 · TCP/IP Illustrated Vol.1 Ch.4
- §16.1ARP Operation (request broadcast, unicast reply, ARP cache aging, GC threshold)
- §16.2Gratuitous ARP — GARP (announce IP move; MAC change; used by VRRP/keepalived takeover; duplicate-IP detection RFC 5227)
- §16.3Proxy ARP (router answers ARP for hosts in another subnet; use cases: PPP, mobile IP, transparent firewall)
- §16.4ARP Spoofing & Defenses (DAI on switches, static ARP, arpwatch, arp-scan)
- §16.5ARP Sponging (used by load balancers and live-migration to redirect traffic mid-flight)
- §16.6RARP / InARP (legacy — Reverse ARP for diskless boot; Inverse ARP on Frame Relay)
- §16.7Linux ARP Table Tuning (gc_thresh1/2/3, base_reachable_time, when ARP table overflows in HPC clusters)
Part XVII — IPv6 Deep Dive
RFC 8200/4861/4862/8106/8415 · IPv6 Essentials (Hagen)
- §17.1IPv6 Header (40-byte fixed; flow label; no checksum; no header-level fragmentation)
- §17.2Address Architecture (Global Unicast 2000::/3, Link-Local fe80::/10, ULA fc00::/7, Multicast ff00::/8, Anycast)
- §17.3Address Types & Scopes (interface-local, link-local, site-local deprecated, global, multicast scopes)
- §17.4EUI-64 Interface ID (MAC-derived, U/L bit flip; privacy concerns)
- §17.5SLAAC — Stateless Address Autoconfig (RFC 4862 — RA + on-link prefix → host generates address; DAD)
- §17.6Privacy Extensions (RFC 8981 — random IID rotation; RFC 7217 stable but opaque IID)
- §17.7NDP — Neighbor Discovery (RFC 4861 — RS, RA, NS, NA, Redirect; replaces ARP + ICMP Redirect)
- §17.8RA — Router Advertisement (M/O flags, prefix info option, MTU option, route info option)
- §17.9DAD — Duplicate Address Detection (NS to solicited-node multicast before claiming)
- §17.10Optimistic DAD (RFC 4429) — start using address before DAD completes
- §17.11RDNSS / DNSSL (RFC 8106 — DNS resolver via RA, replaces stateless DHCPv6 for DNS)
- §17.12DHCPv6 Stateful (IA_NA assignment, RFC 8415, replaces DHCPv4 for managed networks)
- §17.13DHCPv6 Stateless (only options like NTP/SIP, address still SLAAC)
- §17.14DHCPv6-PD — Prefix Delegation (IA_PD; how home routers get a /56 or /60 from ISP)
- §17.15IPv6 Extension Headers (Hop-by-Hop, Routing, Fragment, Destination, AH/ESP — ordering rules)
- §17.16IPv6 Transition Mechanisms (dual-stack, 6to4 deprecated, 6rd, Teredo legacy, NAT64+DNS64, MAP-T/MAP-E, 464XLAT)
- §17.17IPv6 Multicast & MLD (MLDv1/v2, replaces IGMP, runs over ICMPv6)
- §17.18IPv6 Security (RA Guard, DHCPv6 Guard, ND inspection, SAVI, why disabling IPv6 hurts more than helps)
Part XVIII — Multicast
RFC 3376/4604/7761/7450 · Doyle Vol.2
- §18.1IPv4 Multicast Addressing (224.0.0.0/4, 224.0.0.x link-local, 232/8 SSM, 233/8 GLOP, 239/8 admin-scoped)
- §18.2MAC Mapping for Multicast (01:00:5e:00:00:00 + lower 23 bits of group; 32→1 collision)
- §18.3IGMP v1 / v2 / v3 (host membership; v3 adds source filter for SSM)
- §18.4IGMP Snooping (L2 switch tracks IGMP joins; multicast not flooded; querier election)
- §18.5PIM-DM — Dense Mode (flood-and-prune, state refresh, only for small dense groups)
- §18.6PIM-SM — Sparse Mode (RP, shared tree (*,G), source tree (S,G), Register encapsulation, SPT switchover)
- §18.7PIM-SSM — Source-Specific Multicast (no RP, IGMPv3/MLDv2 host signals (S,G) directly; for IPTV)
- §18.8RP Discovery (static, AutoRP, BSR — Bootstrap Router; Anycast-RP via MSDP)
- §18.9RPF — Reverse Path Forwarding Check (loop prevention, unicast routing table by default)
- §18.10Source Tree vs Shared Tree (latency vs state trade-off; SPT-threshold knob)
- §18.11MSDP — Multicast Source Discovery Protocol (inter-domain SA messages, Anycast-RP within a domain)
- §18.12MLD v1/v2 (IPv6 host membership over ICMPv6; MLDv2 is SSM-capable)
- §18.13Bidir-PIM (RFC 5015 — many-to-many, single shared tree, no source state)
- §18.14BIER — Bit Indexed Explicit Replication (RFC 8279 — stateless multicast, ingress encodes bitstring)
- §18.15Multicast in EVPN-VXLAN (Type 6/7/8 routes, head-end vs underlay multicast replication)
Part XIX — NAT & NAT Traversal
RFC 4787/5128/5389/5766/8489 · WebRTC ICE
- §19.1NAT Types (Full Cone, Restricted Cone, Port-Restricted Cone, Symmetric — STUN-defined behaviors)
- §19.2NAPT / PAT — Port Address Translation (port overload, conntrack tuple, port allocation strategies)
- §19.3NAT Conntrack (Linux nf_conntrack — tuple, expectations, helpers for FTP/SIP)
- §19.4Hairpin / NAT Loopback (internal client → public IP → back to internal server)
- §19.5Carrier-Grade NAT — CGN / NAT444 (port-block allocation, IPv4 exhaustion mitigation, logging volume)
- §19.6NAT64 + DNS64 (RFC 6146/6147 — IPv6-only client to IPv4 server)
- §19.7464XLAT (RFC 6877 — CLAT on phone + NAT64 in network; T-Mobile US)
- §19.8NPTv6 (stateless 1:1 IPv6 prefix translation, RFC 6296)
- §19.9STUN (RFC 8489 — discover external mapping, NAT type detection)
- §19.10TURN (RFC 8656 — relay server when direct fails; allocation, permissions)
- §19.11ICE (RFC 8445 — gather candidates → pair → connectivity checks → nominate; trickle ICE)
- §19.12UDP Hole Punching (mutual STUN, simultaneous send; works for cone NATs, fails for symmetric)
- §19.13TCP Hole Punching (TCP simultaneous open; SYN crossing; sequence & state machine challenges)
- §19.14UPnP IGD / NAT-PMP / PCP (router-mediated mapping; PCP is the modern winner)
- §19.15Tailscale / WireGuard / Nebula NAT Traversal (DERP relays, peer-to-peer establishment)
- §19.16WebRTC End-to-End (signaling out-of-band, ICE for media, DTLS-SRTP for security)
Part XX — DHCP
RFC 2131 · RFC 8415 (DHCPv6)
- §20.1DHCPv4 DORA (Discover broadcast → Offer → Request → Ack; xid correlation; lease renewal T1/T2)
- §20.2DHCP Options (1=mask, 3=router, 6=DNS, 12=hostname, 43=vendor, 51=lease, 55=PRL, 60/61=class/client-id)
- §20.3DHCP Option 82 (relay agent info — circuit-ID, remote-ID; used by DHCP snooping & ISP CGNAT)
- §20.4DHCP Option 121 / 249 (classless static routes; how to push routes via DHCP)
- §20.5DHCP Relay (UDP broadcast on access VLAN → unicast to server; ip helper-address)
- §20.6DHCP Snooping (security — only trusted ports may answer; binding table feeds DAI/IPSG)
- §20.7DHCP Server Implementations (ISC dhcpd legacy, Kea modern, dnsmasq for SOHO, Windows DHCP)
- §20.8DHCP HA / Failover (ISC failover protocol, Kea HA hooks, primary/secondary lease split)
- §20.9DHCPv6 (UDP/546-547, multicast ff02::1:2; SOLICIT/ADVERTISE/REQUEST/REPLY; M/O flags interplay with SLAAC)
- §20.10DHCPv6 Prefix Delegation (IA_PD — how home routers get /56 from ISP)
- §20.11PXE Boot / iPXE (next-server option 66, boot file 67, UEFI HTTP boot)
Part XXI — High Availability
RFC 5798 (VRRP) · RFC 5880 (BFD) · keepalived docs
- §21.1HSRP (Cisco — virtual IP/MAC 0000.0c07.acXX, active/standby, group, priority, preempt)
- §21.2VRRP (RFC 5798 — open standard, virtual MAC 0000.5e00.01XX, master/backup election)
- §21.3GLBP (Cisco — load-balancing FHRP, AVG + AVF, weighted vs round-robin vs host-dependent)
- §21.4keepalived (Linux VRRPv2/v3 daemon, healthcheck scripts, IPVS director integration)
- §21.5BFD — Bidirectional Forwarding Detection (RFC 5880, sub-second protocol-agnostic detection; async/demand/echo modes)
- §21.6BFD Multi-hop (RFC 5883 — for iBGP / RR / IPsec tunnels)
- §21.7Anycast HA (BGP-injected /32 from healthy node; DNS root, public DNS 1.1.1.1 / 8.8.8.8)
- §21.8MC-LAG / MLAG (multi-chassis link aggregation, dual-active forwarding, peer-link, peer-keepalive)
- §21.9LVS / IPVS Modes (DR — direct routing same L2; NAT — return through LB; TUN — IP-in-IP)
- §21.10L4 LB Architectures (Maglev consistent hashing, Katran XDP, GitHub GLB, Cloudflare Unimog)
- §21.11L7 LB / Proxy (HAProxy, NGINX, Envoy — health checks, retries, circuit breaker, outlier detection)
- §21.12Stateful Firewall / NAT HA (conntrackd, pacemaker, session sync)
- §21.13DNS-Based Failover (low TTL, GeoDNS, weighted policy)
- §21.14Cluster Resource Manager (Pacemaker + Corosync, Linux-HA, fencing/STONITH)
Part XXII — RDMA & RoCE
IBTA Vol.1/2 · RoCE v2 RFC · Mellanox / NVIDIA docs · interview-critical
- §22.1Why RDMA — kernel bypass, zero-copy, CPU offload (motivation: 100/200/400/800 GbE saturating CPU memcpy)
- §22.2RDMA Operations (SEND/RECV — two-sided; RDMA WRITE / RDMA READ — one-sided; ATOMIC fetch-add / compare-swap)
- §22.3RDMA Verbs API — libibverbs (PD, MR, CQ, QP, WR, WC; ibv_post_send / ibv_post_recv)
- §22.4Queue Pair States (RESET → INIT → RTR → RTS → SQD/SQE → ERR; transitions via ibv_modify_qp)
- §22.5Memory Registration (MR, lkey/rkey, ODP — On-Demand Paging, FRWR — Fast Reg WR)
- §22.6Transport Types (RC — Reliable Connection, UC, UD — Unreliable Datagram, XRC — eXtended Reliable Connection)
- §22.7InfiniBand Fundamentals (HCA, subnet manager OpenSM, LID/GID, GUID, SL/VL — service level / virtual lanes)
- §22.8RoCE v1 — RDMA over Ethernet (L2-only, ethertype 0x8915, no IP routing)
- §22.9RoCE v2 — RDMA over UDP/IP (UDP/4791, routable, used in cloud DC fabrics; uses BTH header)
- §22.10iWARP — RDMA over TCP (RFC 5040 — older, less performant, but tolerates lossy networks)
- §22.11PFC — Priority Flow Control (802.1Qbb — per-priority pause, lossless class for RoCE)
- §22.12ECN & DCQCN (Data Center QCN — RoCE congestion control, ConnectX hardware-offloaded reaction)
- §22.13PFC Deadlock & Headroom Buffering (cyclic dependency on credits; DCBX exchange; CC vs PFC roles)
- §22.14Lossless Ethernet Design (DCB stack — PFC + ETS + DCBX; mlnx_qos tooling)
- §22.15Adaptive Routing & Flowlet (NVIDIA AR; per-packet vs per-flow vs flowlet-level spraying)
- §22.16RDMA in Storage (NVMe-oF over RDMA, NFS over RDMA, SMB Direct, Ceph BlueStore msgr2 RDMA)
- §22.17RDMA Diagnostics (perftest ib_send_bw / ib_write_bw, ibv_devinfo, ibstat, mlx5dump, NVIDIA NEO/UFM)
- §22.18RDMA in K8s (SR-IOV, Multus, RDMA CNI, GPU-Operator, Network Operator)
Part XXIII — AI Training & Inference Networking
NCCL docs · NVIDIA Spectrum-X · Meta RoCE · OCP HPN — interview-critical, intentionally deep
- §23.1Why AI Networking Is Different (synchronous bulk-synchronous traffic, all-to-all incast, microsecond tail-latency)
- §23.2Collective Communication Primitives (Broadcast, Reduce, AllReduce, AllGather, ReduceScatter, AllToAll, Scatter, Gather, Barrier)
- §23.3AllReduce Algorithms — Ring (2(N-1) bandwidth-optimal steps, used by NCCL default for large messages)
- §23.4AllReduce Algorithms — Tree (latency-optimal, log N depth; NCCL Tree for small messages)
- §23.5AllReduce Algorithms — Halving-Doubling, Recursive Doubling, Hierarchical (multi-node + intra-node split)
- §23.6AllToAll Patterns (used in MoE expert routing, sequence parallelism — most network-stressful collective)
- §23.7NCCL — NVIDIA Collective Communications Library (architecture: comm, channel, proxy thread, work queue)
- §23.8NCCL Topology Detection (PCIe / NVLink / NVSwitch / CPU NUMA / NIC affinity → graph search → optimal channels)
- §23.9NCCL Transports (Shared Memory intra-process, P2P over PCIe/NVLink, IB Verbs RDMA, Sockets fallback)
- §23.10NCCL Tuning (NCCL_ALGO ring/tree, NCCL_PROTO LL/LL128/Simple, NCCL_NTHREADS, NCCL_BUFFSIZE, NCCL_IB_HCA)
- §23.11NCCL Plugin API (Networking plugin, e.g. AWS OFI plugin for EFA, Microsoft MSCCL plugin for custom algos)
- §23.12MSCCL / MSCCLPP (Microsoft programmable collectives — XML algo description, GPU-driven for inference)
- §23.13RCCL (AMD ROCm fork of NCCL for MI300/MI250)
- §23.14Gloo / MPI Alternatives (Gloo CPU, OpenMPI, MVAPICH2-GDR, UCX, OneCCL — when not NCCL)
- §23.15NVLink / NVSwitch (intra-node fabric — 5th-gen NVLink 1.8 TB/s, NVL72 rack-scale, NVLink-C2C for Grace-Hopper)
- §23.16GPUDirect RDMA (GPU memory ↔ NIC without host bounce — ConnectX + NVIDIA driver path)
- §23.17GPUDirect Storage / GDS (GPU ↔ NVMe direct via cuFile + nvidia-fs)
- §23.18Rail-Optimized Fat-Tree / Clos for AI (per-GPU rail to dedicated leaf, 8 rails per server, no rail crossing)
- §23.19NVIDIA Spectrum-X (Spectrum-4 + BlueField-3 + adaptive routing + congestion control DDP)
- §23.20Meta AI Backend Network (RoCE-based, 24K-GPU clusters, FBOSS, dual-rail design)
- §23.21OCP Hyperscale Network for AI / SONiC AI Optimizations (open AI fabric reference)
- §23.22Adaptive Routing in AI Fabrics (per-packet spraying with reordering tolerance via NIC, flowlet, IB AR)
- §23.23ECN/PFC Tuning for AI (DCQCN target rate, headroom, watchdog timer; lossless gotchas)
- §23.24Congestion Control for AI — HPCC, Swift, Annapurna, EQDS (next-gen receiver-driven CC)
- §23.25Inference Networking — Disaggregated Prefill/Decode (PD disaggregation, KV-cache transfer over RDMA)
- §23.26Tensor Parallelism / Pipeline Parallelism / Expert Parallelism Traffic Patterns (TP all-reduce per-layer, PP send/recv, EP all-to-all)
- §23.27Communication Frameworks (PyTorch DDP / FSDP, Megatron-LM, DeepSpeed ZeRO — what each demands of the network)
- §23.28AWS EFA / Google JCT / Azure SDN-AI (cloud AI fabric implementations)
- §23.29Backend AI vs Frontend AI (training cluster vs inference serving — different latency/throughput profiles)
- §23.30Worked Example — Tracing One AllReduce (8-GPU node, 2-node, 16 GPUs total: ring chunks, schedule, RDMA WR queue)
Part XXIV — SPDK — Storage Performance Dev Kit
SPDK docs · DPDK shared model · Intel/NVMe
- §24.1SPDK Motivation (kernel I/O stack overhead, polling > interrupts at >1M IOPS, kernel-bypass storage parallel to DPDK)
- §24.2SPDK Architecture (event framework, reactor per lcore, message-passing thread model, no shared mutable state)
- §24.3User-Space NVMe Driver (PCI BAR mmap, SQ/CQ doorbells, MSI-X interrupts via VFIO eventfd, polling preferred)
- §24.4SPDK BDEV Layer (block device abstraction, drivers for NVMe / AIO / virtio-blk / Ceph RBD / iSCSI / NVMe-oF initiator)
- §24.5NVMe-oF Target — Transports (TCP RFC 8009, RDMA, FC; subsystems, namespaces, controllers)
- §24.6BlobStore & BlobFS (lightweight storage abstraction; not a POSIX FS — used by Rocksdb backend)
- §24.7vhost-user-blk / vhost-user-scsi (zero-copy VM I/O — virtqueue shared mem with QEMU)
- §24.8SPDK + DPDK Shared Model (memory allocator rte_malloc, mempool, ring; runs as DPDK secondary or unified)
- §24.9SPDK NVMe-oF Performance (1M+ IOPS per CPU core, sub-10µs latency over RDMA)
- §24.10Real-World Deployments (Alibaba PolarStore, Ceph BlueStore + SPDK, AWS Nitro storage, Azure Premium SSD v2)
- §24.11SPDK vs io_uring (when each wins — full bypass vs in-kernel batched async)
Part XXV — I/O Multiplexing — select / poll / epoll
TLPI Ch.63 · Linux source fs/select.c, fs/eventpoll.c · interview-critical
- §25.1Five I/O Models (blocking, non-blocking, I/O multiplexing, signal-driven, async I/O — Stevens UNP Ch.6)
- §25.2select(2) — fd_set bitmap, FD_SETSIZE=1024, copy in/out every call, O(n) full scan; sys_select() in kernel
- §25.3select Internals (fs/select.c — do_select() loop: poll_wait()/sets bits, wait via __pollwait, restartable timeout)
- §25.4select Limitations (1024 fd cap, expensive setup, no edge-trigger, returns count not list)
- §25.5poll(2) — pollfd array, no FD_SETSIZE limit, still O(n) scan, still copy-in/out every call
- §25.6poll Internals (do_poll() builds wait queues per fd, walks list)
- §25.7epoll Architecture — Big Picture (interest set persistent in kernel; ready list maintained on event; O(1) wait)
- §25.8epoll Kernel Data Structures (struct eventpoll: rbr RB-tree of registered fds, rdllist ready list, ovflist; struct epitem)
- §25.9epoll_create / epoll_create1 (creates anon inode; returns fd; CLOEXEC flag)
- §25.10epoll_ctl ADD / MOD / DEL (insert/update/remove epitem; hooks into target fd's wait queue via ep_ptable_queue_proc)
- §25.11epoll_wait — How Events Reach Ready List (target fd's poll callback ep_poll_callback fires, splices epitem onto rdllist, wakes waiters)
- §25.12Level-Triggered (LT) — Default Semantics (re-reports as long as condition holds; safer; equivalent to poll)
- §25.13Edge-Triggered (EPOLLET) — One Notify Per Transition (must drain till EAGAIN; non-blocking fd required)
- §25.14EPOLLONESHOT (one-shot, must rearm via EPOLL_CTL_MOD; clean handoff between threads)
- §25.15EPOLLEXCLUSIVE (RFC 4.5 — wake only one waiter; mitigates accept thundering herd in multi-process listeners)
- §25.16epoll Drain Rule (with ET: read until EAGAIN; with LT: optional but faster with batching)
- §25.17Common Pitfalls (close() of fd auto-removes from epoll only if last ref; dup'd fds trap; TOCTOU on unregister)
- §25.18epoll vs kqueue vs IOCP (BSD/macOS unified event filters; Windows completion-based vs readiness-based)
- §25.19epoll vs io_uring (readiness vs true async; io_uring SQ/CQ shared rings; multishot + zero-copy)
- §25.20Reactor Pattern (epoll_wait → dispatch → handler — one thread per loop)
- §25.21Proactor Pattern (true async completion — io_uring, IOCP)
- §25.22Worked Example — Echo Server Progression (select → poll → epoll-LT → epoll-ET; benchmarks; pitfalls at each step)
- §25.23Worked Example — High-Concurrency HTTP Server with epoll-ET + accept4 + SO_REUSEPORT
Part XXVI — Event Loop Libraries — libev / libevent / libuv
libev manual · libevent book · libuv design · interview-critical
- §26.1Why Wrap epoll/kqueue/IOCP — portability, watcher abstraction, timer wheel, signal safety
- §26.2Library Landscape (libev — minimalist by Marc Lehmann; libevent — older, more features; libuv — Node.js, cross-platform incl. Windows)
- §26.3libev Architecture — Loops & Watchers (one struct ev_loop, many ev_*_watcher embedded into user struct)
- §26.4libev Watcher Types (ev_io fd readiness, ev_timer relative, ev_periodic absolute/repeating, ev_signal, ev_child SIGCHLD, ev_stat inotify, ev_idle, ev_prepare/check loop hooks, ev_async cross-thread wakeup, ev_embed nested loop, ev_fork)
- §26.5libev Backend Selection (auto: epoll on Linux, kqueue on BSD/Mac, port on Solaris, poll/select fallback; EVBACKEND_* flags)
- §26.6libev Core Loop (ev_run / ev_loop) — Phases (1.before-fork → 2.queue pending → 3.invoke check → 4.fdupdate → 5.timer → 6.io wait → 7.invoke pending → repeat)
- §26.7libev Timer Implementation (4-heap min-heap; O(log n) insert/extract; ev_now caching to avoid repeated clock_gettime)
- §26.8libev fd-to-Watchers Map (ANFD array indexed by fd; multiple watchers per fd via linked list; reify on next loop iteration)
- §26.9libev Priority Queue (priority -2..+2; pending events queued by priority; invoke_pending walks from highest)
- §26.10libev Signal Handling — Safe Async (signalfd if available; pipe-based wakeup fallback; ev_signal watcher coalesces deliveries)
- §26.11libev Fork Handling (ev_loop_fork: re-arm epoll fd in child, re-register signals; child should not call old loop)
- §26.12libev Threading Model (one loop per thread; ev_async only safe cross-thread API; ev_loop is NOT thread-safe)
- §26.13libev Embed Watcher (run a child loop inside parent — used to mix backends, e.g. select inside epoll loop)
- §26.14libev vs libevent (libev = simpler, faster, less feature creep; libevent = HTTP/RPC helpers, evbuffer, evdns, deprecated event_base API)
- §26.15libuv Internals — Cross-Platform (epoll/kqueue/IOCP/event ports; thread pool for FS + DNS; req-based async file I/O)
- §26.16Worked Example — libev Echo Server (ev_io accept watcher → spawn ev_io read watcher per conn; ev_timer idle reaper)
- §26.17Worked Example — Tearing Down libev for an Interviewer (loop-by-loop walkthrough; how each watcher type maps to a kernel mechanism)
- §26.18Common Pitfalls (forgetting ev_io_stop on close; ET vs LT mismatch — libev assumes LT; not draining means infinite wakeups)
- §26.19Choosing Library (libev for embedded / minimalist; libevent for HTTP; libuv for cross-platform Node-like; raw epoll if you want zero abstraction)
Appendix — Common Protocols & Well-Known Ports
| Protocol | Transport / Port | Notes |
|---|---|---|
DNS | UDP/53, TCP/53, DoT TCP/853, DoH TCP/443 | UDP for queries, TCP for >512B / zone xfer |
DHCP / BOOTP | UDP/67 server, UDP/68 client | Broadcast at L2 then unicast |
DHCPv6 | UDP/547 server, UDP/546 client | ff02::1:2 link-scoped multicast |
HTTP / HTTPS | TCP/80, TCP/443; HTTP/3 UDP/443 | QUIC over UDP for HTTP/3 |
SSH | TCP/22 | Default for sshd, scp, sftp, ssh tunnels |
BGP | TCP/179 | MD5 / TCP-AO authentication |
OSPF | IP proto 89 | 224.0.0.5 (all SPF), 224.0.0.6 (DR) |
EIGRP | IP proto 88 | 224.0.0.10 |
IS-IS | L2 directly (no IP) | AllL1ISs / AllL2ISs MAC |
VRRP | IP proto 112 | 224.0.0.18 |
PIM | IP proto 103 | 224.0.0.13 |
IGMP | IP proto 2 | v2 gen-query 224.0.0.1 |
LDP | TCP/646, UDP/646 hello | Targeted hellos for remote LDP |
GRE | IP proto 47 | Generic encap; classic tunnel |
IPsec ESP | IP proto 50 | AH IP proto 51, IKE UDP/500, NAT-T UDP/4500 |
VXLAN | UDP/4789 (RFC 7348) | Linux historically used 8472 (pre-IANA) |
Geneve | UDP/6081 | Variable-length TLV options |
WireGuard | UDP (configurable, 51820 default) | Single UDP port |
NVMe-oF / TCP | TCP/4420 | RFC 8009; or RDMA on 4791 |
RoCE v2 | UDP/4791 | BTH header inside UDP |
RDMA CM | TCP/18 (well-known) — actually port 18 unused; RDMA CM uses random ports | Verbs allocates QP numbers |
NTP | UDP/123 | Stratum hierarchy, leap seconds |
Syslog | UDP/514, TCP/6514 TLS | RFC 5424 structured data |
SNMP | UDP/161 query, UDP/162 trap | v3 has authPriv security |
NetFlow / IPFIX | UDP/2055 / 4739 | Templated flow records |
BFD | UDP/3784 single-hop, UDP/4784 multi-hop, UDP/3785 echo | Sub-second liveness |
Appendix — TCP State Machine Quick Reference
| State | Side | Triggered by | Next on normal path |
|---|---|---|---|
CLOSED | both | Initial / after teardown | LISTEN (server) / SYN-SENT (client) |
LISTEN | server | listen() | SYN-RCVD on incoming SYN |
SYN-SENT | client | connect() sends SYN | ESTABLISHED on SYN-ACK |
SYN-RCVD | server | Got SYN, sent SYN-ACK | ESTABLISHED on ACK |
ESTABLISHED | both | Handshake complete | FIN-WAIT-1 (active close) / CLOSE-WAIT (passive close) |
FIN-WAIT-1 | active closer | close() sends FIN | FIN-WAIT-2 (ACK only) / CLOSING (FIN crosses) / TIME-WAIT (FIN+ACK) |
FIN-WAIT-2 | active closer | Peer ACKed our FIN | TIME-WAIT on peer's FIN |
CLOSE-WAIT | passive closer | Got peer's FIN | LAST-ACK after own close() |
LAST-ACK | passive closer | Sent FIN after peer's | CLOSED on peer's ACK |
CLOSING | both (rare) | Simultaneous close — FIN crossing | TIME-WAIT on ACK |
TIME-WAIT | active closer | Final ACK sent | CLOSED after 2*MSL |
Appendix — Congestion Control Algorithms Cheat Sheet
| Algo | Signal | cwnd Behavior | Best For |
|---|---|---|---|
Tahoe | Loss (3 dup ACK or RTO) | cwnd = 1, slow start to ssthresh = cwnd/2 | Historical baseline |
Reno | Loss (3 dup ACK) | cwnd = ssthresh = cwnd/2 + fast recovery | Low-loss small-RTT links |
NewReno | Loss + partial ACK | Stay in fast recovery for multiple losses | Pre-SACK era; still default fallback |
CUBIC | Loss (cubic function of t since loss) | Cubic concave then convex around W_max | Long-haul high-BDP TCP — Linux default |
BBR v1 | Bandwidth × min-RTT (model-based) | Pace at estimated BtlBw × min-RTT, no slow-start collapse | Long-haul, lossy, video/CDN |
BBR v2 | BBR signal + ECN + loss | Adds ECN response and CUBIC-fairness | DC + WAN mixed traffic |
Vegas | RTT increase (delay-based) | Reduce on RTT growth, no loss needed | Low-loss links; loses to Reno in mixed |
Westwood | Loss + bandwidth estimate | ssthresh = bw * min-RTT after loss | Wireless / lossy links |
DCTCP | ECN-CE fraction | α-weighted multiplicative decrease per round | Data center fabrics with ECN-marking switches |
CTCP | Loss + delay | AIMD + delay-based component (Microsoft) | Windows long-haul |
HTCP | Loss, time-since-loss | Aggressive cwnd growth on long no-loss periods | Very high-BDP scientific links |
Appendix — I/O Multiplexing API Comparison
| API | OS | Style | Complexity | Limit | Notes |
|---|---|---|---|---|---|
select | POSIX everywhere | Readiness | O(n) scan, O(n) copy | FD_SETSIZE = 1024 | Bitmap in/out, oldest, broken at scale |
poll | POSIX everywhere | Readiness | O(n) scan, O(n) copy | RLIMIT_NOFILE | Better than select; still O(n) |
epoll | Linux 2.6+ | Readiness | O(1) wait, O(log n) ctl | RLIMIT_NOFILE | ET / LT, EPOLLEXCLUSIVE, persistent kernel state |
kqueue | BSD / macOS | Readiness + filters | O(1) wait | kern.maxfilesperproc | Filters on fs, signals, timers, processes |
IOCP | Windows | Completion | O(1) | — | True async — kernel completes I/O, posts to queue |
io_uring | Linux 5.1+ | Completion (true async) | O(1) batched | RLIMIT_NOFILE | SQ/CQ shared rings, SQPOLL, multishot, registered FDs/buffers |
AIO (libaio) | Linux | Completion | O(1) batched | — | Only O_DIRECT; effectively replaced by io_uring |
POSIX AIO | POSIX | Completion | User-thread emulation | — | Slow — glibc emulates with threads |
Appendix — High Availability Mechanisms
| Mechanism | Layer | Failover Time | Common Use |
|---|---|---|---|
HSRP | L3 first-hop (Cisco) | ~3-10s default, sub-second tuned | Default-gateway redundancy on access network |
VRRP | L3 first-hop (RFC) | ~3s default, sub-second tuned | Open-standard FHRP, used by keepalived |
GLBP | L3 first-hop + LB (Cisco) | Like HSRP | Active-active gateway load balancing |
BFD | L3-agnostic liveness | <50ms typical | Speed up OSPF/BGP/static convergence |
MC-LAG / vPC | L2 | Sub-second | Server multi-homing without STP blocking |
StackWise / VSS | Chassis | ISSU sub-second; RPR/SSO sub-second | Two physical → one logical control plane |
Anycast (BGP) | L3 routed | BGP convergence (1-30s) | DNS, CDN, public services |
LFA / TI-LFA | IGP | <50ms | IGP-driven sub-50ms protection |
MPLS FRR | MPLS | <50ms | RSVP-TE backup tunnels |
Pacemaker / Corosync | Service | Seconds | Resource manager + STONITH |
Keepalived + IPVS | L4 LB | Sub-second | Linux virtual server with VRRP failover |
Appendix — AI Collective Operations Cheat Sheet
| Op | What it does | Bandwidth Cost (N ranks, M bytes) | When used |
|---|---|---|---|
Broadcast | 1 → all (root sends to everyone) | M (per link in tree) | Initial weight distribution |
Reduce | all → 1 (sum/max/min at root) | M | Aggregating loss / metrics to rank 0 |
AllReduce | all → all of reduced value | 2M(N-1)/N (ring; bandwidth-optimal) | DDP / FSDP gradient sync — most common |
AllGather | concat tensors from all → all | M(N-1)/N | FSDP unshard, sequence parallelism gather |
ReduceScatter | elementwise reduce + scatter slices | M(N-1)/N | FSDP gradient pre-shard; half of AllReduce |
AllToAll | rank i sends slice j to rank j | M(N-1)/N — but every pair | MoE expert dispatch, sequence parallelism |
Scatter | 1 → all (each gets a slice) | M(N-1)/N | Initial data partitioning |
Gather | all → 1 (concatenate slices) | M(N-1)/N | Collect outputs to rank 0 |
Barrier | synchronize without data | log N rounds | Phase boundaries |
Appendix — RDMA Verb Operations
| Operation | Sided | Receiver CPU? | Notes |
|---|---|---|---|
SEND / RECV | Two-sided | Yes (must post RECV) | Like sockets — needs matching RECV WR posted |
RDMA WRITE | One-sided | No | Initiator writes into peer's pre-registered MR using rkey |
RDMA WRITE with IMM | One-sided + signal | Yes (consumes RECV) | Write + 4-byte immediate value triggers receive completion |
RDMA READ | One-sided | No | Initiator reads from peer's MR; lower throughput than WRITE |
ATOMIC FETCH_ADD | One-sided RMW | No | 8-byte atomic, consistent across HCA & host CPU only on certain hw |
ATOMIC CMP_SWP | One-sided RMW | No | Compare-and-swap on remote 8 bytes |
SEND with INVALIDATE | Two-sided | Yes | Invalidates a receiver-side rkey atomically with delivery |
Appendix — libev Watcher Types Quick Reference
| Watcher | Triggered by | Mapped to |
|---|---|---|
ev_io | fd readable / writable | epoll_ctl ADD on backend |
ev_timer | Relative timeout (after X seconds, optional repeat) | Min-heap; loop computes nearest deadline for epoll_wait timeout |
ev_periodic | Absolute time / cron-like reschedule callback | Min-heap with reschedule cb |
ev_signal | POSIX signal received | signalfd or pipe + sigaction handler |
ev_child | SIGCHLD for a specific PID | Internal signal watcher + waitpid |
ev_stat | File stat changes (path-based) | inotify if available, else periodic stat() |
ev_idle | No other events pending | Run after all ready events processed |
ev_prepare | Before each loop iteration's poll | Hook used by glue layers (Perl, etc.) |
ev_check | After each poll, before invoke | Hook for glue layers |
ev_async | ev_async_send() from another thread | eventfd / pipe wakeup — ONLY safe cross-thread API |
ev_embed | Inner ev_loop made pollable as one fd | Run a kqueue inside an epoll loop, etc. |
ev_fork | After fork() in child | Cleanup on fork |
ev_cleanup | Loop destroyed | Final teardown hook |
Appendix — Cisco Certification Path Quick Reference
| Track | CCNA | CCNP (core + concentration) | CCIE (lab) |
|---|---|---|---|
| Enterprise | 200-301 CCNA | ENCOR 350-401 + ENARSI / ENSLD / ENWLSI / ENWLSD / SPCOR / etc. | CCIE Enterprise Infrastructure (Lab v1.x) |
| Data Center | 200-301 CCNA | DCCOR 350-601 + DCID / DCACI / DCACIA / DCAUI | CCIE Data Center |
| Service Provider | 200-301 CCNA | SPCOR 350-501 + SPRI / SPVI / SPCNI / SPAUI | CCIE Service Provider |
| Security | 200-301 CCNA | SCOR 350-701 + SISE / SNCF / SVPN / SWSA / SAUTO | CCIE Security |
| Collaboration | 200-301 CCNA | CLCOR 350-801 + CLICA / CLACCM / CLCEI / CLAUTO | CCIE Collaboration |
| DevNet | DEVASC | DEVCOR 350-901 + concentration | CCDE / DevNet Expert |