27. Server / Service HA
Anycast, LVS/IPVS, Maglev, Katran, HAProxy, Envoy, conntrackd, Pacemaker, and DNS failover as layered service availability tools.
1. 27.1 - Anycast with BGP /32
Anycast publishes the same service IP from multiple locations. For a DNS resolver, CDN edge, or DDoS scrubber, the application sees one address while BGP policy decides which PoP receives each client's packets.
Failover is a routing event: a health agent withdraws the host route, upstream routers run normal best-path selection, and new packets move to another PoP after convergence.
Minimal C Demo - Anycast Route Selection
2. 27.2 - LVS / IPVS Modes
LVS is a kernel load balancer. The Director owns the VIP and IPVS picks a real server. The key design choice is where the return packet goes and whether the real server can share L2, hold the VIP locally, or live behind a tunnel.
| Mode | Header action | Return path | Main constraint |
|---|---|---|---|
| NAT | DNAT request, SNAT response. | Through Director. | Director handles both directions. |
| DR | Destination MAC rewrite; IP destination remains VIP. | Direct from RS to client. | Same L2 and ARP suppression on RS. |
| TUN | IP-in-IP encapsulation to RS. | Direct from RS to client. | RS must terminate tunnel and hold VIP on loopback. |
| FULLNAT | Source and destination NAT. | Often through LB tier. | Original client IP needs TOA or proxy metadata. |
Minimal C Demo - LVS Header Transform
Minimal C Demo - LVS Scheduling
3. 27.3 and 27.4 - L4/L7 Load Balancers
Maglev and Katran solve high-scale L4 distribution with stable flow hashing. Maglev uses a large lookup table so every LB instance makes the same backend choice; Katran pushes the fast path into XDP and BPF maps.
Minimal C Demo - Maglev Slot Stability
HAProxy and Envoy operate higher in the stack. HAProxy is direct and operationally compact: frontends, backends, ACLs, health checks, stick tables, and a runtime socket. Envoy is a programmable proxy with xDS, filter chains, clusters, retries, outlier detection, circuit breakers, and rich telemetry.
Minimal C Demo - HAProxy Stick Table
Minimal C Demo - Envoy Circuit Breaker
4. 27.5 - conntrackd
Active/backup firewalls and NAT load balancers need more than VIP movement. Existing TCP sessions also depend on Linux conntrack tuples, sequence state, and NAT mappings. conntrackd replicates that state before failure.
Minimal C Demo - Conntrack State Sync
5. 27.6 - Pacemaker / Corosync
Corosync supplies cluster messaging, membership, and quorum. Pacemaker decides where resources should run, calls resource agents to start or monitor them, and requires fencing before a survivor takes over shared state.
STONITH is not ceremony. If a two-node cluster loses communication, both nodes can believe they are the survivor. Fencing proves the old owner is dead before promoting a database, VIP, or shared filesystem elsewhere.
Minimal C Demo - Split-Brain Prevention
6. 27.7 - DNS Failover
DNS failover is useful for regional or provider-level steering, but it is not instant. The lower bound is health-check detection, authoritative record change, recursive resolver TTL expiry, and the application's next retry.
| Technique | Best at | Weakness |
|---|---|---|
| DNS failover | Coarse regional steering and primary/secondary service records. | Bounded by caches and client retry behavior. |
| BGP anycast | Infrastructure-level nearest PoP selection and DDoS absorption. | Needs routing control and careful state handling. |
| L4 LB failover | Fast local service failover with health checks. | Stateful sessions need sync or reconnect logic. |
Minimal C Demo - DNS Recovery Timer
7. Core Mechanism Walkthrough
Background: A public HTTPS service uses a BGP anycast VIP. Inside each PoP, two keepalived LVS Directors front a pool of real servers in DR mode. Stateful NAT is not on the hot path, but firewall state exists on the edge pair.
Plan: Fail small first. Remove bad real servers with local health checks, promote the standby Director with conntrack state if the active node dies, and withdraw the BGP /32 only when the whole PoP is unhealthy.
| Failure | Detector | Recovery action | Blast radius |
|---|---|---|---|
| One real server fails | LVS or HAProxy health check. | Remove RS from scheduler. | Only flows to that RS reconnect or drain. |
| Active Director fails | VRRP or BFD via keepalived. | Backup owns VIP; conntrackd commits state if needed. | Local PoP only. |
| PoP service fails | External health checker. | Withdraw anycast /32 or change DNS answer. | Regional clients reroute. |
| Cluster split-brain risk | Corosync membership and quorum. | Fence old owner before promotion. | Protects shared state from dual ownership. |
8. Source and Tooling Pointers
ipvsadm -Ln --statsshows VIPs, real servers, schedulers, and counters.conntrack -Landconntrackd -sexpose replicated connection state.crm_mon -1summarizes Pacemaker resource placement, quorum, and failed actions.show ip bgp 203.0.113.53/32or BMP telemetry proves anycast advertisement state.echo "show table"through HAProxy's stats socket inspects stick tables and server health.
9. Interview Prep
Questions and concise answers
| Why is LVS DR faster than NAT mode? | The Director handles only inbound traffic; real servers reply directly to clients. |
| Why does DR mode require ARP tuning? | Real servers hold the VIP on loopback but must not answer LAN ARP for it, or they bypass the Director. |
| How does Maglev reduce disruption? | Surviving backends keep most of their lookup-table slots; only slots owned by the removed backend move. |
| Why sync conntrack state? | Without replicated tuples and NAT state, standby takeover breaks existing stateful TCP flows. |
| Why is STONITH mandatory for serious Pacemaker designs? | It prevents two nodes from owning the same writable resource after a membership split. |
| Why is DNS failover slow? | Health checks, authoritative changes, resolver caches, client caches, and retry intervals all add delay. |