18. Leader Election & Consensus
In distributed systems, often a single node must be responsible for a specific task — this node is the leader. Leader election is the process by which distributed nodes agree on which one is the leader. Consensus is the broader problem of getting distributed nodes to agree on a value.
Why Leader Election?
| Use Case | Why a Leader Is Needed |
|---|---|
| Database primary | Only one node should accept writes |
| Distributed locks | A coordinator manages lock ownership |
| Task scheduling | One node assigns work to avoid duplicates |
| Configuration management | One source of truth for cluster state |
| Log compaction | One node triggers cleanup operations |
| Resource allocation | One node makes allocation decisions |
Without a leader: Split-brain, duplicate processing, data conflicts, chaos.
Leader Election Challenges
| Challenge | Description |
|---|---|
| Network partitions | Some nodes can't communicate — both sides may elect a leader → split-brain |
| Node failures | Leader crashes — how quickly can a new leader be elected? |
| Byzantine faults | Nodes may behave maliciously or send conflicting messages |
| Clock skew | Nodes have different views of time |
| Liveness | The system must eventually elect a leader (not be stuck) |
| Safety | At most one leader at any time |
Leader Election Approaches
1. Bully Algorithm
The node with the highest ID becomes the leader.
Nodes: A(1), B(2), C(3), D(4), E(5)
E is leader. E crashes.
D detects E is down → D sends ELECTION to all higher IDs (E)
→ No response from E
→ D declares itself leader
If C initiated: C sends ELECTION to D, E
→ D responds "I'm higher, I'll take over"
→ D sends ELECTION to E
→ No response → D becomes leader
Pros: Simple.
Cons: Not partition-tolerant; not suitable for production distributed systems.
2. Ring Algorithm
Nodes form a logical ring. Election messages travel around the ring until the highest-priority node is determined.
Pros: Simple, deterministic.
Cons: Slow (must traverse entire ring); not fault-tolerant.
3. Consensus-Based (Production Standard)
Use a consensus algorithm to agree on a leader. This is the standard approach for production systems.
Consensus Algorithms
Paxos
The original consensus algorithm by Leslie Lamport (1989). Mathematically proven correct but notoriously difficult to implement and understand.
Roles:
| Role | Description |
|------|-------------|
| Proposer | Proposes a value |
| Acceptor | Votes on proposals |
| Learner | Learns the decided value |
Two phases:
Phase 1 (Prepare):
Proposer → [Prepare(n)] → Acceptors
Acceptors → [Promise(n, last_accepted)] → Proposer
Phase 2 (Accept):
Proposer → [Accept(n, value)] → Acceptors
Acceptors → [Accepted(n, value)] → Learners
Guarantee: If a majority of acceptors agree, the value is chosen. No two different values can be chosen.
Variants:
- Multi-Paxos: Optimized for multiple rounds with a stable leader.
- Fast Paxos: Reduces latency by allowing direct proposer-to-acceptor communication.
- Cheap Paxos: Uses fewer acceptors with auxiliary nodes.
Raft
Designed as an understandable alternative to Paxos by Diego Ongaro and John Ousterhout (2014). Equivalent to Multi-Paxos in guarantees.
States:
Every node is in one of three states:
FOLLOWER ──(timeout, no heartbeat)──→ CANDIDATE ──(wins election)──→ LEADER
↑ │ │
│ │ │
└──────────(discovers new leader)──────┘ │
└──────────(receives heartbeat)────────────────────────────────────┘
Leader Election in Raft:
1. Follower's election timeout expires (no heartbeat from leader)
2. Follower becomes CANDIDATE
3. Candidate increments its TERM and votes for itself
4. Candidate sends RequestVote to all other nodes
5. Each node votes for at most one candidate per term
6. If candidate receives majority votes → becomes LEADER
7. Leader sends periodic heartbeats to maintain authority
Term 1: Node A is leader
Term 1: Node A crashes
Term 2: Node C starts election
→ Node B votes for C ✓
→ Node D votes for C ✓
→ Node E votes for C ✓
→ C wins (3/5 majority) → C becomes leader for Term 2
Log Replication in Raft:
Client → Leader (append entry to log)
Leader → Followers (AppendEntries RPC)
Followers → Leader (ACK)
When majority ACK → Leader commits entry
Leader → Followers (commit notification)
Raft Guarantees:
| Property | Description |
|----------|-------------|
| Election safety | At most one leader per term |
| Leader append-only | Leader never overwrites or deletes its log entries |
| Log matching | If two logs have an entry with the same index and term, all preceding entries are identical |
| Leader completeness | If an entry is committed, it will be present in all future leaders' logs |
Raft is used by: etcd, CockroachDB, TiKV, Consul, RethinkDB.
Zab (ZooKeeper Atomic Broadcast)
The consensus protocol used by Apache ZooKeeper.
1. All writes go to the LEADER
2. Leader broadcasts proposal to FOLLOWERS
3. Followers ACK
4. If majority ACK → Leader sends COMMIT
5. Followers apply the committed change
Similar to Raft/Multi-Paxos but with atomic broadcast semantics.
Leader Election Services
Instead of implementing consensus yourself, use a distributed coordination service:
Apache ZooKeeper
[ZK Ensemble: ZK1, ZK2, ZK3 (quorum)]
Service A → Create ephemeral node /election/leader → Success → I'm the leader!
Service B → Create ephemeral node /election/leader → Fails → I'm a follower
Service C → Create ephemeral node /election/leader → Fails → I'm a follower
Service A crashes → Ephemeral node disappears → ZooKeeper notifies watchers
Service B → Creates /election/leader → Success → New leader!
etcd
etcd uses Raft consensus internally.
Service A → Create key with lease: /leader = "service-a" (lease TTL: 10s)
→ Refresh lease every 5s (heartbeat)
Service A crashes → Lease expires after 10s → Key deleted
Service B → Creates /leader = "service-b" → New leader
Consul
consul lock leader-election/my-service <command>
Uses session-based locking with TTL.
Split-Brain Prevention
Split-brain: Two nodes both believe they are the leader.
Network partition:
[Node A (leader)] ←─ PARTITION ─→ [Node B, Node C]
Node A: "I'm still leader" (stale)
Node B+C: Elect Node B as new leader
→ Two leaders! → Data inconsistency!
Solutions:
| Solution | Description |
|---|---|
| Quorum | Require majority agreement; minority partition can't elect a leader |
| Fencing tokens | Each leader gets an incrementing token; old token is rejected |
| Lease-based | Leadership expires if not renewed via heartbeat |
| STONITH | "Shoot The Other Node In The Head" — force-kill the old leader |
Fencing Tokens
Leader A gets token: 33
Leader A crashes, Leader B elected with token: 34
Leader A recovers, tries to write with token 33
Storage: "Token 33 < 34 → REJECTED" (fenced off)
Consensus in Practice
| System | Protocol | Use Case |
|---|---|---|
| etcd | Raft | Kubernetes control plane, service discovery |
| ZooKeeper | Zab | Kafka controller, HBase master election |
| Consul | Raft | Service discovery, distributed locks |
| CockroachDB | Raft | Distributed SQL consensus |
| TiKV | Raft | Distributed KV storage |
| Google Spanner | Paxos | Globally distributed database |
FLP Impossibility
Fischer, Lynch, Paterson (1985):
In an asynchronous distributed system, it is impossible to guarantee consensus if even one process can fail.
Practical implication: Consensus algorithms work by using timeouts (partial synchrony) to make progress, even though they can't guarantee termination in a purely asynchronous system.
Summary
| Concept | Key Point |
|---|---|
| Leader election | One node coordinates — prevents conflicts |
| Raft | Understandable consensus; used by etcd, CockroachDB |
| Paxos | Original consensus; mathematically proven, hard to implement |
| Quorum | Majority agreement prevents split-brain |
| Fencing tokens | Prevent stale leaders from writing |
| ZooKeeper/etcd/Consul | Use these instead of building your own consensus |
| FLP | Perfect async consensus is impossible; use timeouts |
Rule of thumb: Never implement your own consensus algorithm. Use etcd, ZooKeeper, or Consul for leader election and distributed coordination.