§ 8 Data Center and Cloud Fabrics
Spine-leaf Clos, BGP-only underlays, EVPN-VXLAN overlays, ACI, NSX-T, OVN, CNI networking, hyperscale fabrics, AWS VPCs, RoCE, and DCI.
1. Overview
Modern data centers keep the physical fabric simple and routed, then put tenant mobility, segmentation, policy, and service insertion in overlays. The leaf is where servers attach and overlays begin; the spine is only a fast ECMP transit layer.
| Component | Job | Design notes |
|---|---|---|
| Leaf | Server-facing gateway and VTEP | Anycast gateway, EVPN routes, BGP sessions to all spines |
| Spine | Transit-only ECMP layer | No host ports, no stretched VLANs, forwards underlay IP |
| VTEP | Overlay edge | Encapsulates tenant frames into VXLAN or Geneve over the routed fabric |
| EVPN Type 2 | MAC/IP reachability | Enables control-plane learning and ARP suppression |
| EVPN Type 3 | BUM membership | Tells VTEPs how to replicate broadcast, unknown unicast, and multicast |
| EVPN Type 5 | Prefix reachability | Carries routed subnet routes for L3VNI symmetric IRB |
| T0/T1 | NSX routing tiers | T0 connects physical network; T1 connects segments and services |
| ENI | AWS virtual NIC | Security group attachment point and private IP identity |
2. Clos and BGP Underlay
A two-tier Clos gives every leaf the same number of equal-cost paths to every other leaf. Oversubscription is calculated at the leaf: for example, 48 x 25G downlinks and 4 x 100G uplinks create a 3:1 ratio.
RFC 7938-style eBGP fits this topology because each link is an explicit policy boundary, ECMP is native, and failures are withdrawn by BGP rather than flooded through an IGP area. eBGP unnumbered removes the operational cost of numbering every point-to-point link.
3. EVPN-VXLAN Overlay
EVPN is the BGP control plane and VXLAN is the data plane. Type 2 carries MAC/IP bindings, Type 3 advertises BUM replication membership, and Type 5 carries routed prefixes for L3VNI symmetric IRB.
Symmetric IRB keeps tenant routing distributed. The ingress leaf routes into an L3VNI, the underlay only forwards an outer IP packet, and the egress leaf routes back out to its local VNI.
4. ACI and NSX-T
Cisco ACI exposes a policy model: tenants own VRFs, bridge domains, EPGs, and contracts. APIC programs the Nexus fabric so operators describe who may talk to whom instead of hand-writing ACLs on every interface.
The physical ACI fabric is still a spine-leaf VXLAN fabric. APIC clusters push policy southbound, leaf switches enforce endpoint policy, and L3Out connects the fabric to traditional routing domains.
NSX-T moves the overlay and firewall into the hypervisor. T0 gateways peer northbound with the physical network, T1 gateways connect segments and services, and the distributed firewall enforces east-west policy at the VM vNIC before traffic hairpins anywhere.
5. OVS, OVN, and CNI
OVS is the programmable switch on the host; OVN is the control plane that compiles logical switches, routers, ACLs, and port bindings into OpenFlow tables on each node.
Kubernetes CNI choices decide whether pod traffic is routed directly, tunneled, or accelerated through eBPF. Calico can advertise pod CIDRs to the leaf fabric via BGP, while Cilium attaches policy, load balancing, and observability at TC/XDP hooks.
Cilium replaces large iptables chains with BPF maps and programs. The same hook can decide policy, translate a Service VIP, and emit Hubble flow events for debugging.
6. Hyperscale and AWS VPC
Hyperscalers converged on merchant silicon, disaggregated network operating systems, BGP-only underlays, automation-first control planes, and massive ECMP fabrics. Facebook F16, Google Jupiter, and Azure SONiC differ in implementation, but the design direction is the same.
AWS VPC hides the physical Clos behind a regional virtual L3 network. ENIs are the identity and policy attachment point, the mapping service locates private IPs on physical hosts, Hyperplane scales load-balanced services, GWLB inserts appliances with GENEVE, and TGW provides transitive routing.
7. RoCE and DCI
RoCE fabrics need loss avoidance because the fast hardware path is not designed around normal TCP-style retransmission. PFC pauses a single priority class before a queue overflows, but headroom must absorb every bit already in flight after the pause is sent.
ECN and DCQCN keep PFC from becoming the normal congestion-control tool. Switches mark packets before hard loss, receivers send CNPs, and senders cut rate before probing back upward.
DCI connects fabrics across sites. EVPN-VXLAN over a routed or optical WAN is the modern approach; stretch L2 only when a workload truly requires it, because it enlarges broadcast and failure domains.
8. Core Mechanism Walkthrough
Background: Host B in subnet 10.20.0.0/24 sends to Host A in subnet 10.10.0.0/24 on another leaf. Traditional L2 flooding would scale poorly, so the fabric uses EVPN control-plane learning and symmetric IRB.
Plan: learn Host A as a Type 2 route, suppress ARP from the ingress leaf cache, select the destination prefix through Type 5, route into the L3VNI, then decapsulate and route out at the egress leaf.
| Step | Control or data plane | State change |
|---|---|---|
| 1 | EVPN Type 2 | Remote leaves learn Host A MAC/IP and VTEP without flooding. |
| 2 | ARP suppression | Leaf answers Host B locally from its EVPN cache. |
| 3 | EVPN Type 5 | Ingress leaf selects the egress VTEP for 10.10.0.0/24. |
| 4 | VXLAN L3VNI | Packet crosses the spine as an outer routed IP packet. |
| 5 | Egress IRB | Egress leaf routes from L3VNI to local VNI and forwards to Host A. |
9. Minimal C Demo
The ECMP demo models a 4-leaf, 2-spine fabric. Try 0 0 6 for normal hashing, 1 0 6 for a failed spine, or 0 2 6for a single failed leaf-to-spine link.
The EVPN demo traces the route type behind each overlay event. Try1 for host learning, 2 for ARP suppression, and3 for symmetric IRB.
10. Source Pointers
- RFC 7938: BGP routing in large-scale data centers.
- RFC 7432: BGP MPLS-Based Ethernet VPN, the foundation for EVPN route types.
- RFC 7348: VXLAN network virtualization overlay format.
- Open vSwitch and OVN manuals: OVSDB, ovn-northd, southbound flows, and logical pipelines.
- AWS VPC, Transit Gateway, Gateway Load Balancer, and Hyperplane architecture papers and docs.
11. Interview Prep
- Why use eBGP-only in a DC underlay? It scales by link-local policy boundaries, supports ECMP naturally, and avoids IGP flooding and area design.
- Which EVPN route suppresses ARP? Type 2 MAC/IP advertisement lets a leaf answer ARP locally from control-plane state.
- What does symmetric IRB change? Both ingress and egress leaves route through an L3VNI, so leafs only need local VLANs plus the VRF VNI.
- How does ACI differ from VLAN plus ACL operations? ACI models tenants, EPGs, and contracts, then programs the fabric from policy intent.
- Why does RoCE need PFC and DCQCN? PFC prevents loss at queue overflow, while ECN/DCQCN controls congestion before pause frames dominate.