Tech Notes

1. Overview

Raft turns a cluster into one ordered state machine: the leader assigns every client command a log index, replicates that entry to a majority, and only then applies it. The hard parts are keeping elections safe, repairing divergent logs, bounding replay time with snapshots, and making the hot path batch enough work to reach high IOPS.

2. Key Data Structures

Raft is built from three wire/data records: log entries carry commands, AppendEntries both heartbeats and replicates data, and RequestVote enforces the up-to-date election rule.

Record	Field	Type	Purpose
`LogEntry`	`term`	uint64	Leader epoch that created the entry; lets followers detect stale/conflicting history.
`LogEntry`	`index`	uint64	Monotonic position in the replicated log; used for commit and replay ordering.
`LogEntry`	`command`	bytes	Opaque state-machine operation, such as a key-value write or session update.
`AppendEntries`	`prevLogIndex`	uint64	Index immediately before the new entries; follower must already have it.
`AppendEntries`	`prevLogTerm`	uint64	Term at prevLogIndex; rejects divergent suffixes before overwriting them.
`AppendEntries`	`entries[]`	LogEntry[]	New entries; empty array is a heartbeat.
`AppendEntries`	`leaderCommit`	uint64	Leader commit index so followers can apply durable entries.
`RequestVote`	`lastLogIndex`	uint64	Candidate log length for the up-to-date check.
`RequestVote`	`lastLogTerm`	uint64	Candidate last term; compared before index for election safety.
`InstallSnapshot`	`lastIncludedIndex`	uint64	Highest log entry represented by the snapshot.
`InstallSnapshot`	`offset/data/done`	uint64/bytes/bool	Chunked transfer fields for large snapshots.

3. Core Mechanism

§15.1 — Leader Election

A node is normally a follower; if it stops receiving valid leader heartbeats before its randomized election timeout, it becomes a candidate and asks the cluster for votes.

The election RPC is deliberately small: the candidate includes its term and last log position, and each voter grants at most one vote per term.

Log Replication and Commit

Background: a client write must survive leader failure without letting two different commands occupy the same committed log slot.

Plan: the leader appends locally, sends AppendEntries with the previous slot proof, waits for a majority, advances commitIndex, applies the command, and replies to the client.

Example: assume the leader has committed through index 41 and receives set x=7. It appends entry 42 term 9, followers accept only if their index 41 term matches, and the leader commits 42 after any two followers in a five-node cluster acknowledge it.

Election Restriction

A voter grants a vote only when the candidate log is at least as up-to-date as its own log; term wins first, then index breaks ties. This prevents a leader that lacks a committed entry from being elected.

§15.2 — Snapshots, Freeze, and Replay

Background: an always-growing log makes restart and follower catch-up proportional to the lifetime write volume, so production Raft periodically snapshots the state machine and truncates old entries.

Plan: freeze the applied boundary, serialize state, persist snapshot metadata, truncate covered log entries, then unfreeze writes and send InstallSnapshot to followers that are too far behind.

Crash recovery restores the latest snapshot first, then replays only the suffix after lastIncludedIndex, which turns replay from millions of entries into the recent tail.

Single-server membership changes keep quorum math simple: add one node, catch it up, promote it, and only then remove another node if needed.

InstallSnapshot is chunked so a leader can transfer a large state image without requiring one enormous RPC buffer.

§15.3 — Performance Optimizations

Without pipelining, a leader pays one network RTT per AppendEntries batch and leaves the follower and transport idle between acknowledgements.

With pipelining, the leader tracks nextIndex and matchIndex per follower and sends later batches before earlier ACKs return.

Batching accumulates several client commands into one log append and one AppendEntries round, amortizing serialization, TCP packetization, and fsync overhead.

Optimization	Pseudocode shape	Why it helps
Pipelining	`while inflight < window: send AppendEntries(nextIndex[f])`	Follower work, disk flushes, and network RTT overlap instead of serializing per request.
Batching	`buffer append; flush when size == N or timer expires`	One quorum round commits multiple commands, which is essential for 100K IOPS class throughput.

4. Minimal C Demo

The first demo models the core AppendEntries consistency check: a follower accepts a leader batch only when the previous index and term match, then deletes any conflicting suffix.

Raft AppendEntries Conflict Repair — C Demo

stdin (optional)

The second demo shows why batching and pipelining move throughput: fewer quorum rounds and overlapping waits reduce time per command.

Raft Batching and Pipelining Cost — C Demo