§ 16 Load Balancing, NAT Gateway, and Session Sync
Flow-stable backend selection, NAT tuple allocation, and HA state replication for fast-path network systems.
1. Overview
A production load balancer is a packet classifier, a flow-to-backend mapper, and often a NAT gateway with state that must survive failover. The hard part is keeping each flow stable while still distributing load, avoiding SNAT collisions, and replicating just enough session state for a backup node to continue forwarding.
2. Key Data Structures
Consistent hashing stores each backend as multiple virtual nodes in one sorted ring; a key walks clockwise to the first vnode with a hash point greater than or equal to the key hash.
| Field | Type / Size | Purpose |
|---|---|---|
point | uint32_t / 4 bytes | Hash coordinate for this virtual node in the ring. |
server_id | uint32_t / 4 bytes | Backend selected when a key lands on this vnode. |
weight | uint16_t / 2 bytes | Controls how many vnodes are created for weighted capacity. |
generation | uint32_t / 4 bytes | Lets workers swap to a new ring without mutating the old one in place. |
ECMP does not keep a central session table; switches and routers compute the same hash over the 5-tuple and choose one of the equal-cost next hops.
| Algorithm | Pros | Cons | Best Use |
|---|---|---|---|
| Round robin | Simple and cheap. | Can split one client session across backends unless flow affinity is added. | Stateless request-level proxies. |
| Weighted round robin | Accounts for backend capacity. | Still needs health and session-affinity handling. | Mixed-size backend pools. |
| ECMP | Distributed and hardware-friendly. | Path changes can remap flows unless resilient hashing is used. | Leaf-spine fabrics and gateway clusters. |
| Consistent hash | Minimal remapping when a backend changes. | Needs virtual nodes and careful rebalancing for hot keys. | Stateful services, caches, and VPC load balancers. |
Conntrack stores two related tuples: the original direction observed before translation and the reply direction after SNAT.
| Field | Type / Size | Purpose |
|---|---|---|
src_addr / dst_addr | __be32 / 4 bytes each for IPv4 | Endpoint addresses used for hash lookup and packet rewrite. |
src_port / dst_port | __be16 / 2 bytes each | Transport ports; SNAT changes the source port to make the public tuple unique. |
proto | u8 / 1 byte | TCP, UDP, ICMP, or another L4 protocol discriminator. |
timeout | unsigned long / machine word | Expiration deadline used by garbage collection and HA sync. |
3. Core Mechanism
Consistent Hash Membership Change
Background: A VPC load balancer needs to add or remove backends without sending most existing flows to different machines.
Plan: 1. Represent each backend as many virtual nodes. 2. Sort vnodes by hash point. 3. Map each flow key to the first vnode clockwise. 4. On membership change, build a new ring and publish it by generation.
Example: If server D inserts a vnode at point 45, only keys in the arc from the previous vnode at 30 through 45 move from C or B to D; keys at 7, 56, 80, and 93 keep their old destination.
NAT Gateway Port Selection
Background: Thousands of private hosts can connect to the same remote address through one public IP, so the NAT must allocate a unique public source port for each live tuple.
Plan: 1. Hash the original 5-tuple into the ephemeral range. 2. Test whether the translated reply tuple already exists. 3. If it collides, scan candidate ports. 4. Insert conntrack before forwarding the packet.
Example: Host 10.0.1.7:51515 wants 8.8.8.8:53. The hash chooses 40002, but 40002 is in use, so the gateway tries 40003, then 40004, inserts that tuple, and rewrites the source to 203.0.113.5:40004.
NAT Gateway HA
A primary NAT gateway owns the virtual route and public address while a backup watches liveness and receives state deltas; failover works only if route ownership and conntrack state move together.
Session Synchronization
Background: A virtual switch or gateway loses active flows after failover unless the backup already has tuple, action, timeout, and forwarding metadata.
Plan: 1. Emit ADD messages before forwarding new sessions. 2. Batch UPDATE messages for counters and timeouts. 3. Send DEL when a session closes. 4. Promote the backup only after heartbeat timeout and route convergence.
Example: A new TCP flow is assigned to backend 10.2.0.9. The primary forwards it and sends an ADD delta. Later it sends timeout updates. If the primary dies, the backup promotes with the same mapping and avoids resetting the flow.
Delta sync is the normal steady-state mode because most large session tables are cold; full sync is useful after a backup restarts or detects a generation mismatch.
Failover combines failure detection, state promotion, and fabric routing; any one of those steps can become the outage boundary.
4. Minimal C Demo
5. Kernel Source Pointers
net/netfilter/nf_nat_core.c-nf_nat_setup_info(),get_unique_tuple(),find_best_ips_proto()net/netfilter/nf_conntrack_core.c-nf_conntrack_in(),__nf_conntrack_confirm()include/net/netfilter/nf_conntrack_tuple.h-struct nf_conntrack_tuplenet/ipv4/fib_semantics.candnet/ipv4/route.c- multipath route selection and flow hashingnet/sched/act_ct.c- traffic-control conntrack action used by virtual switching paths
6. Interview Prep
| Question | Concise Answer |
|---|---|
| What is a virtual node in consistent hashing? | A virtual node is one hash-ring point representing a real backend; many vnodes smooth distribution and encode backend weight. |
| Why does consistent hashing minimize rehashing? | Adding or removing a vnode changes only the adjacent ring arc, so most keys keep the same backend. |
| What is ECMP? | Equal-cost multipath routing chooses one of several equivalent next hops, usually by hashing the packet 5-tuple so packets in one flow stay ordered. |
| How does SNAT avoid port collisions? | It hashes to a preferred ephemeral port, checks whether the translated tuple exists in conntrack, scans alternatives on collision, then confirms the entry. |
| What is the conntrack tuple pair? | Conntrack stores the original direction and the reply direction; NAT changes the reply tuple so inbound packets map back to the private endpoint. |
| How does NAT gateway HA work? | The primary forwards traffic and syncs conntrack deltas; the backup detects failure, promotes, takes over the route or VIP, and uses the synced table. |
| Delta sync or full sync? | Use delta sync for normal operation because it sends only changes; use full sync after restart, drift detection, or generation mismatch. |