§ 3 StackWise, VSS, vPC, MLAG, and L3 ECMP
How two or more switches become one operational unit, why split-brain is the central failure mode, and why modern fabrics increasingly de-stack into routed ECMP.
1. § 3.1 — StackWise Ring
StackWise turns several Catalyst switches into one logical switch through a proprietary ring backplane. The active master owns configuration, protocols, and management; the standby is ready to take over, while member switches contribute ports and forwarding ASICs.
| Role | Selection | Operational Meaning |
|---|---|---|
| Active master | Highest priority, then tie-breakers such as MAC/platform state | Owns CLI, STP, routing protocols, and stack configuration |
| Standby | Next best candidate | Keeps enough state to take over during master failure |
| Member | Remaining switches | Provides local ports and hardware forwarding under master control |
2. § 3.2 — StackWise Virtual
StackWise Virtual keeps the single-switch management model but stretches it across two chassis with an SVL bundle. A separate dual-active detection path matters because the worst failure is not a dead chassis; it is two live chassis advertising the same identity.
stackwise-virtual domainbinds the pair into one logical system.- SVL carries control synchronization and traffic that must cross between chassis.
- DAD should use an independent path so SVL failure can be distinguished from peer death.
3. § 3.3 — VSS
VSS on Catalyst 6500/6800 merges two chassis into one switch through the Virtual Switch Link. The active supervisor handles protocols; the standby is synchronized with SSO/NSF, while both chassis forward data and support multichassis EtherChannel downstream.
Dual-active detection can use PAgP-enhanced fast hello, BFD, or a dedicated fast-hello link. Once dual-active is confirmed, the losing chassis goes into recovery and shuts down data ports instead of letting duplicate gateway MAC/IP state corrupt the network.
4. § 3.4 — vPC
vPC lets two Nexus switches present one LACP system to a downstream server or access switch without merging both control planes. The peer-link carries VLAN traffic plus MAC/ARP synchronization; the keepalive is only a separate heartbeat used to avoid split-brain decisions.
The important interview case is peer-link failure. If keepalive remains up, the primary keeps forwarding while the secondary suspends vPC member ports and secondary orphan ports, preventing both peers from forwarding as independent owners of the same LAG.
Minimal C Demo — vPC Failure Walk-through
5. § 3.5 — MLAG and MC-LAG
MLAG is the same operational idea as vPC in vendor-specific form: two switches synchronize enough state to appear as one LAG endpoint. Peer-gateway avoids needless peer-link hairpinning by letting each switch answer for the shared gateway MAC locally.
- Arista MLAG uses a peer-link and a local VLAN interface for peer adjacency.
- Juniper MC-LAG uses ICCP over TCP for inter-chassis coordination.
- ARP/MAC sync is what keeps failover from forcing every host to rediscover the gateway.
6. § 3.6 — De-stack into Pure L3 ECMP
The de-stack trend removes the peer-link as a shared fate point. Servers or ToR leaves use routed links and BGP ECMP, often with host routing through Bird or GoBGP, so every uplink is active and a single leaf failure stays local to directly attached hosts.
Minimal C Demo — ECMP Path Selection
7. § 3.7 — Failure Domain Comparison
| Architecture | Failure Domain | Complexity | Bandwidth Efficiency | Convergence |
|---|---|---|---|---|
| StackWise | Ring/control failure affects the stack as one logical switch | Medium | Good through stack backplane | Fast with standby state |
| VSS / SVL | VSL/SVL failure can become dual-active | High | Good with local switching | Fast with SSO and DAD |
| vPC / MLAG | Peer-link failure suspends secondary-side risk surfaces | Medium | Good; both access links active | Fast with keepalive/BFD |
| Pure L3 ECMP | One leaf failure affects only attached hosts or routes | Low at L2, higher host/routing requirement | Excellent; all routed paths active | Fast with BGP and BFD |
8. § 3.8 — ISSU and GIR
ISSU depends on redundant control planes and synchronized forwarding state: upgrade the standby side, switch over with SSO, then upgrade the old active side. GIR is the broader maintenance pattern: drain the node by gracefully withdrawing routes before touching it.
9. § 3.9 — Dual-Active Detection
Split-brain means the inter-switch control link failed while both halves are still alive. Detection methods such as PAgP enhanced fast hello, BFD, or dedicated fast hello prove the peer still exists, then recovery isolates one side by suspending non-management ports.
Minimal C Demo — Dual-Active Recovery Choice
10. § 3.10 — Migration Stories
A clean migration keeps old and new designs side by side long enough to move one failure domain at a time. StackWise to SVL mainly changes cabling and chassis identity; SVL/vPC to L3 leaf-spine changes the operating model by replacing L2 adjacency with routed reachability.
11. Interview Prep
- What is the difference between vPC peer-link and keepalive? The peer-link carries VLAN traffic and state sync; keepalive is a separate heartbeat for peer-liveness decisions.
- What happens when a vPC peer-link fails but keepalive stays up? Primary keeps forwarding; secondary suspends vPC member ports and secondary orphan ports.
- Why is dual-active dangerous? Both halves advertise the same logical switch identity, creating duplicate MAC/IP ownership and blackholes.
- Why move to pure L3 ECMP? It removes STP/MLAG split-brain domains and uses all links, at the cost of host or application L3 awareness.
- What does ISSU require? SSO/NSF-capable redundancy, compatible images, synchronized forwarding state, and strict pre/post checks.