Part XXI - High Availability

27. Server / Service HA

Anycast, LVS/IPVS, Maglev, Katran, HAProxy, Envoy, conntrackd, Pacemaker, and DNS failover as layered service availability tools.

1. 27.1 - Anycast with BGP /32

Anycast publishes the same service IP from multiple locations. For a DNS resolver, CDN edge, or DDoS scrubber, the application sees one address while BGP policy decides which PoP receives each client's packets.

Failover is a routing event: a health agent withdraws the host route, upstream routers run normal best-path selection, and new packets move to another PoP after convergence.

Minimal C Demo - Anycast Route Selection

Anycast Routing Simulation — C Demo
stdin (optional)

2. 27.2 - LVS / IPVS Modes

LVS is a kernel load balancer. The Director owns the VIP and IPVS picks a real server. The key design choice is where the return packet goes and whether the real server can share L2, hold the VIP locally, or live behind a tunnel.

ModeHeader actionReturn pathMain constraint
NATDNAT request, SNAT response.Through Director.Director handles both directions.
DRDestination MAC rewrite; IP destination remains VIP.Direct from RS to client.Same L2 and ARP suppression on RS.
TUNIP-in-IP encapsulation to RS.Direct from RS to client.RS must terminate tunnel and hold VIP on loopback.
FULLNATSource and destination NAT.Often through LB tier.Original client IP needs TOA or proxy metadata.

Minimal C Demo - LVS Header Transform

LVS Mode Comparison — C Demo
stdin (optional)

Minimal C Demo - LVS Scheduling

LVS Weighted Scheduling — C Demo
stdin (optional)

3. 27.3 and 27.4 - L4/L7 Load Balancers

Maglev and Katran solve high-scale L4 distribution with stable flow hashing. Maglev uses a large lookup table so every LB instance makes the same backend choice; Katran pushes the fast path into XDP and BPF maps.

Minimal C Demo - Maglev Slot Stability

Maglev Hash Table Redistribution — C Demo
stdin (optional)

HAProxy and Envoy operate higher in the stack. HAProxy is direct and operationally compact: frontends, backends, ACLs, health checks, stick tables, and a runtime socket. Envoy is a programmable proxy with xDS, filter chains, clusters, retries, outlier detection, circuit breakers, and rich telemetry.

Minimal C Demo - HAProxy Stick Table

HAProxy Stick Table Failover — C Demo
stdin (optional)

Minimal C Demo - Envoy Circuit Breaker

Envoy Circuit Breaker — C Demo
stdin (optional)

4. 27.5 - conntrackd

Active/backup firewalls and NAT load balancers need more than VIP movement. Existing TCP sessions also depend on Linux conntrack tuples, sequence state, and NAT mappings. conntrackd replicates that state before failure.

Minimal C Demo - Conntrack State Sync

conntrackd State Sync — C Demo
stdin (optional)

5. 27.6 - Pacemaker / Corosync

Corosync supplies cluster messaging, membership, and quorum. Pacemaker decides where resources should run, calls resource agents to start or monitor them, and requires fencing before a survivor takes over shared state.

STONITH is not ceremony. If a two-node cluster loses communication, both nodes can believe they are the survivor. Fencing proves the old owner is dead before promoting a database, VIP, or shared filesystem elsewhere.

Minimal C Demo - Split-Brain Prevention

Pacemaker STONITH Decision — C Demo
stdin (optional)

6. 27.7 - DNS Failover

DNS failover is useful for regional or provider-level steering, but it is not instant. The lower bound is health-check detection, authoritative record change, recursive resolver TTL expiry, and the application's next retry.

TechniqueBest atWeakness
DNS failoverCoarse regional steering and primary/secondary service records.Bounded by caches and client retry behavior.
BGP anycastInfrastructure-level nearest PoP selection and DDoS absorption.Needs routing control and careful state handling.
L4 LB failoverFast local service failover with health checks.Stateful sessions need sync or reconnect logic.

Minimal C Demo - DNS Recovery Timer

DNS Failover Simulator — C Demo
stdin (optional)

7. Core Mechanism Walkthrough

Background: A public HTTPS service uses a BGP anycast VIP. Inside each PoP, two keepalived LVS Directors front a pool of real servers in DR mode. Stateful NAT is not on the hot path, but firewall state exists on the edge pair.

Plan: Fail small first. Remove bad real servers with local health checks, promote the standby Director with conntrack state if the active node dies, and withdraw the BGP /32 only when the whole PoP is unhealthy.

FailureDetectorRecovery actionBlast radius
One real server failsLVS or HAProxy health check.Remove RS from scheduler.Only flows to that RS reconnect or drain.
Active Director failsVRRP or BFD via keepalived.Backup owns VIP; conntrackd commits state if needed.Local PoP only.
PoP service failsExternal health checker.Withdraw anycast /32 or change DNS answer.Regional clients reroute.
Cluster split-brain riskCorosync membership and quorum.Fence old owner before promotion.Protects shared state from dual ownership.

8. Source and Tooling Pointers

  • ipvsadm -Ln --stats shows VIPs, real servers, schedulers, and counters.
  • conntrack -L and conntrackd -s expose replicated connection state.
  • crm_mon -1 summarizes Pacemaker resource placement, quorum, and failed actions.
  • show ip bgp 203.0.113.53/32 or BMP telemetry proves anycast advertisement state.
  • echo "show table" through HAProxy's stats socket inspects stick tables and server health.

9. Interview Prep

Questions and concise answers
Why is LVS DR faster than NAT mode?The Director handles only inbound traffic; real servers reply directly to clients.
Why does DR mode require ARP tuning?Real servers hold the VIP on loopback but must not answer LAN ARP for it, or they bypass the Director.
How does Maglev reduce disruption?Surviving backends keep most of their lookup-table slots; only slots owned by the removed backend move.
Why sync conntrack state?Without replicated tuples and NAT state, standby takeover breaks existing stateful TCP flows.
Why is STONITH mandatory for serious Pacemaker designs?It prevents two nodes from owning the same writable resource after a membership split.
Why is DNS failover slow?Health checks, authoritative changes, resolver caches, client caches, and retry intervals all add delay.