Part XV — Raft

§15.1–15.3 Raft Consensus, Snapshots, and 100K IOPS

Leader election, log replication, safety rules, snapshot recovery, and the batching and pipelining patterns behind practical Raft throughput.

1. Overview

Raft turns a cluster into one ordered state machine: the leader assigns every client command a log index, replicates that entry to a majority, and only then applies it. The hard parts are keeping elections safe, repairing divergent logs, bounding replay time with snapshots, and making the hot path batch enough work to reach high IOPS.

2. Key Data Structures

Raft is built from three wire/data records: log entries carry commands, AppendEntries both heartbeats and replicates data, and RequestVote enforces the up-to-date election rule.

RecordFieldTypePurpose
LogEntrytermuint64Leader epoch that created the entry; lets followers detect stale/conflicting history.
LogEntryindexuint64Monotonic position in the replicated log; used for commit and replay ordering.
LogEntrycommandbytesOpaque state-machine operation, such as a key-value write or session update.
AppendEntriesprevLogIndexuint64Index immediately before the new entries; follower must already have it.
AppendEntriesprevLogTermuint64Term at prevLogIndex; rejects divergent suffixes before overwriting them.
AppendEntriesentries[]LogEntry[]New entries; empty array is a heartbeat.
AppendEntriesleaderCommituint64Leader commit index so followers can apply durable entries.
RequestVotelastLogIndexuint64Candidate log length for the up-to-date check.
RequestVotelastLogTermuint64Candidate last term; compared before index for election safety.
InstallSnapshotlastIncludedIndexuint64Highest log entry represented by the snapshot.
InstallSnapshotoffset/data/doneuint64/bytes/boolChunked transfer fields for large snapshots.

3. Core Mechanism

§15.1 — Leader Election

A node is normally a follower; if it stops receiving valid leader heartbeats before its randomized election timeout, it becomes a candidate and asks the cluster for votes.

The election RPC is deliberately small: the candidate includes its term and last log position, and each voter grants at most one vote per term.

Log Replication and Commit

Background: a client write must survive leader failure without letting two different commands occupy the same committed log slot.

Plan: the leader appends locally, sends AppendEntries with the previous slot proof, waits for a majority, advances commitIndex, applies the command, and replies to the client.

Example: assume the leader has committed through index 41 and receives set x=7. It appends entry 42 term 9, followers accept only if their index 41 term matches, and the leader commits 42 after any two followers in a five-node cluster acknowledge it.

Election Restriction

A voter grants a vote only when the candidate log is at least as up-to-date as its own log; term wins first, then index breaks ties. This prevents a leader that lacks a committed entry from being elected.

§15.2 — Snapshots, Freeze, and Replay

Background: an always-growing log makes restart and follower catch-up proportional to the lifetime write volume, so production Raft periodically snapshots the state machine and truncates old entries.

Plan: freeze the applied boundary, serialize state, persist snapshot metadata, truncate covered log entries, then unfreeze writes and send InstallSnapshot to followers that are too far behind.

Crash recovery restores the latest snapshot first, then replays only the suffix after lastIncludedIndex, which turns replay from millions of entries into the recent tail.

Single-server membership changes keep quorum math simple: add one node, catch it up, promote it, and only then remove another node if needed.

InstallSnapshot is chunked so a leader can transfer a large state image without requiring one enormous RPC buffer.

§15.3 — Performance Optimizations

Without pipelining, a leader pays one network RTT per AppendEntries batch and leaves the follower and transport idle between acknowledgements.

With pipelining, the leader tracks nextIndex and matchIndex per follower and sends later batches before earlier ACKs return.

Batching accumulates several client commands into one log append and one AppendEntries round, amortizing serialization, TCP packetization, and fsync overhead.

OptimizationPseudocode shapeWhy it helps
Pipeliningwhile inflight < window: send AppendEntries(nextIndex[f])Follower work, disk flushes, and network RTT overlap instead of serializing per request.
Batchingbuffer append; flush when size == N or timer expiresOne quorum round commits multiple commands, which is essential for 100K IOPS class throughput.

4. Minimal C Demo

The first demo models the core AppendEntries consistency check: a follower accepts a leader batch only when the previous index and term match, then deletes any conflicting suffix.

Raft AppendEntries Conflict Repair — C Demo
stdin (optional)

The second demo shows why batching and pipelining move throughput: fewer quorum rounds and overlapping waits reduce time per command.

Raft Batching and Pipelining Cost — C Demo
stdin (optional)

5. Kernel Source Pointers

Raft is not a Linux kernel subsystem, so the useful source pointers are production consensus libraries that mirror the mechanisms above.

ProjectFiles / functionsWhat to inspect
etcd/raftraft.go: campaign, Step, becomeLeader, handleAppendEntriesCompact reference implementation of leader election, log replication, and membership changes.
etcd/raftlog.go and log_unstable.goShows committed/applied indexes, unstable entries, snapshots, and conflict repair.
HashiCorp Raftraft.go and replication.goProduction-oriented replication loops, batching, snapshot install, and transport abstractions.
TiKV raft-rssrc/raft.rs and src/raft_log.rsRust implementation used in a high-throughput storage system.

6. Interview Prep

QuestionConcise answer
Walk through leader election.A follower times out, increments term, votes for itself, sends RequestVote, and becomes leader only after a majority grants votes. Randomized timeouts reduce split votes.
When is an entry committed?A leader commits an entry from its current term after that entry is stored on a majority. It then advances commitIndex and applies entries in order.
Why does the election restriction matter?A voter rejects candidates whose logs are less up-to-date, so a newly elected leader cannot be missing an entry that was already committed by a previous majority.
What is log compaction?The state machine is snapshotted at an applied index, older log entries covered by the snapshot are truncated, and lagging followers receive InstallSnapshot.
How does crash recovery work?Reload durable term and vote, load the latest snapshot, replay the remaining log suffix, then rejoin as a follower until a valid leader contacts it.
Why do pipelining and batching help 100K IOPS?Batching amortizes RPC and fsync cost across many commands, while pipelining keeps multiple AppendEntries requests in flight so throughput is limited by bandwidth and disk, not one RTT per command.