18. Leader Election & Consensus

In distributed systems, often a single node must be responsible for a specific task — this node is the leader. Leader election is the process by which distributed nodes agree on which one is the leader. Consensus is the broader problem of getting distributed nodes to agree on a value.


Why Leader Election?

Use Case Why a Leader Is Needed
Database primary Only one node should accept writes
Distributed locks A coordinator manages lock ownership
Task scheduling One node assigns work to avoid duplicates
Configuration management One source of truth for cluster state
Log compaction One node triggers cleanup operations
Resource allocation One node makes allocation decisions

Without a leader: Split-brain, duplicate processing, data conflicts, chaos.


Leader Election Challenges

Challenge Description
Network partitions Some nodes can't communicate — both sides may elect a leader → split-brain
Node failures Leader crashes — how quickly can a new leader be elected?
Byzantine faults Nodes may behave maliciously or send conflicting messages
Clock skew Nodes have different views of time
Liveness The system must eventually elect a leader (not be stuck)
Safety At most one leader at any time

Leader Election Approaches

1. Bully Algorithm

The node with the highest ID becomes the leader.

Nodes: A(1), B(2), C(3), D(4), E(5)

E is leader. E crashes.
D detects E is down → D sends ELECTION to all higher IDs (E)
  → No response from E
  → D declares itself leader

If C initiated: C sends ELECTION to D, E
  → D responds "I'm higher, I'll take over"
  → D sends ELECTION to E
  → No response → D becomes leader

Pros: Simple.
Cons: Not partition-tolerant; not suitable for production distributed systems.

2. Ring Algorithm

Nodes form a logical ring. Election messages travel around the ring until the highest-priority node is determined.

Pros: Simple, deterministic.
Cons: Slow (must traverse entire ring); not fault-tolerant.

3. Consensus-Based (Production Standard)

Use a consensus algorithm to agree on a leader. This is the standard approach for production systems.


Consensus Algorithms

Paxos

The original consensus algorithm by Leslie Lamport (1989). Mathematically proven correct but notoriously difficult to implement and understand.

Roles:
| Role | Description |
|------|-------------|
| Proposer | Proposes a value |
| Acceptor | Votes on proposals |
| Learner | Learns the decided value |

Two phases:

Phase 1 (Prepare):
Proposer → [Prepare(n)] → Acceptors
Acceptors → [Promise(n, last_accepted)] → Proposer

Phase 2 (Accept):
Proposer → [Accept(n, value)] → Acceptors
Acceptors → [Accepted(n, value)] → Learners

Guarantee: If a majority of acceptors agree, the value is chosen. No two different values can be chosen.

Variants:

  • Multi-Paxos: Optimized for multiple rounds with a stable leader.
  • Fast Paxos: Reduces latency by allowing direct proposer-to-acceptor communication.
  • Cheap Paxos: Uses fewer acceptors with auxiliary nodes.

Raft

Designed as an understandable alternative to Paxos by Diego Ongaro and John Ousterhout (2014). Equivalent to Multi-Paxos in guarantees.

States:

Every node is in one of three states:

 FOLLOWER ──(timeout, no heartbeat)──→ CANDIDATE ──(wins election)──→ LEADER
    ↑                                      │                           │
    │                                      │                           │
    └──────────(discovers new leader)──────┘                           │
    └──────────(receives heartbeat)────────────────────────────────────┘

Leader Election in Raft:

1. Follower's election timeout expires (no heartbeat from leader)
2. Follower becomes CANDIDATE
3. Candidate increments its TERM and votes for itself
4. Candidate sends RequestVote to all other nodes
5. Each node votes for at most one candidate per term
6. If candidate receives majority votes → becomes LEADER
7. Leader sends periodic heartbeats to maintain authority
Term 1: Node A is leader
Term 1: Node A crashes
Term 2: Node C starts election
         → Node B votes for C ✓
         → Node D votes for C ✓
         → Node E votes for C ✓
         → C wins (3/5 majority) → C becomes leader for Term 2

Log Replication in Raft:

Client → Leader (append entry to log)
Leader → Followers (AppendEntries RPC)
Followers → Leader (ACK)
When majority ACK → Leader commits entry
Leader → Followers (commit notification)

Raft Guarantees:
| Property | Description |
|----------|-------------|
| Election safety | At most one leader per term |
| Leader append-only | Leader never overwrites or deletes its log entries |
| Log matching | If two logs have an entry with the same index and term, all preceding entries are identical |
| Leader completeness | If an entry is committed, it will be present in all future leaders' logs |

Raft is used by: etcd, CockroachDB, TiKV, Consul, RethinkDB.


Zab (ZooKeeper Atomic Broadcast)

The consensus protocol used by Apache ZooKeeper.

1. All writes go to the LEADER
2. Leader broadcasts proposal to FOLLOWERS
3. Followers ACK
4. If majority ACK → Leader sends COMMIT
5. Followers apply the committed change

Similar to Raft/Multi-Paxos but with atomic broadcast semantics.


Leader Election Services

Instead of implementing consensus yourself, use a distributed coordination service:

Apache ZooKeeper

[ZK Ensemble: ZK1, ZK2, ZK3 (quorum)]

Service A → Create ephemeral node /election/leader → Success → I'm the leader!
Service B → Create ephemeral node /election/leader → Fails → I'm a follower
Service C → Create ephemeral node /election/leader → Fails → I'm a follower

Service A crashes → Ephemeral node disappears → ZooKeeper notifies watchers
Service B → Creates /election/leader → Success → New leader!

etcd

etcd uses Raft consensus internally.

Service A → Create key with lease: /leader = "service-a" (lease TTL: 10s)
          → Refresh lease every 5s (heartbeat)

Service A crashes → Lease expires after 10s → Key deleted
Service B → Creates /leader = "service-b" → New leader

Consul

consul lock leader-election/my-service <command>

Uses session-based locking with TTL.


Split-Brain Prevention

Split-brain: Two nodes both believe they are the leader.

Network partition:
[Node A (leader)] ←─ PARTITION ─→ [Node B, Node C]

Node A: "I'm still leader" (stale)
Node B+C: Elect Node B as new leader
→ Two leaders! → Data inconsistency!

Solutions:

Solution Description
Quorum Require majority agreement; minority partition can't elect a leader
Fencing tokens Each leader gets an incrementing token; old token is rejected
Lease-based Leadership expires if not renewed via heartbeat
STONITH "Shoot The Other Node In The Head" — force-kill the old leader

Fencing Tokens

Leader A gets token: 33
Leader A crashes, Leader B elected with token: 34

Leader A recovers, tries to write with token 33
Storage: "Token 33 < 34 → REJECTED" (fenced off)

Consensus in Practice

System Protocol Use Case
etcd Raft Kubernetes control plane, service discovery
ZooKeeper Zab Kafka controller, HBase master election
Consul Raft Service discovery, distributed locks
CockroachDB Raft Distributed SQL consensus
TiKV Raft Distributed KV storage
Google Spanner Paxos Globally distributed database

FLP Impossibility

Fischer, Lynch, Paterson (1985):

In an asynchronous distributed system, it is impossible to guarantee consensus if even one process can fail.

Practical implication: Consensus algorithms work by using timeouts (partial synchrony) to make progress, even though they can't guarantee termination in a purely asynchronous system.


Summary

Concept Key Point
Leader election One node coordinates — prevents conflicts
Raft Understandable consensus; used by etcd, CockroachDB
Paxos Original consensus; mathematically proven, hard to implement
Quorum Majority agreement prevents split-brain
Fencing tokens Prevent stale leaders from writing
ZooKeeper/etcd/Consul Use these instead of building your own consensus
FLP Perfect async consensus is impossible; use timeouts

Rule of thumb: Never implement your own consensus algorithm. Use etcd, ZooKeeper, or Consul for leader election and distributed coordination.