13. Replication & Redundancy

title: 13. Replication & Redundancy

Replication is the process of maintaining multiple copies of data across different machines or locations. Redundancy means having backup systems ready to take over if a primary system fails. Both are essential for high availability, fault tolerance, and performance.

Why Replicate?

Goal	How Replication Helps
High availability	If one node fails, others serve traffic
Fault tolerance	Data survives hardware failures
Read scalability	Multiple replicas handle read queries in parallel
Geographic performance	Replicas close to users reduce latency
Disaster recovery	Data accessible even if an entire datacenter goes down

Replication Topologies

1. Single Leader (Primary-Replica / Master-Slave)

One node (leader/primary) handles all writes. Replicas (followers) receive copies of the data and serve reads.

Writes ──→ [Primary/Leader] ──replication──→ [Replica 1] ←── Reads
                             ──replication──→ [Replica 2] ←── Reads
                             ──replication──→ [Replica 3] ←── Reads

Advantages:

Simple mental model — one source of truth.
No write conflicts.
Read scalability by adding replicas.

Disadvantages:

Primary is a write bottleneck.
Primary failure requires failover (leader election).
Replication lag → eventual consistency for reads.

Failover process:

1. Detect primary failure (timeout / health check)
2. Choose a replica to promote (most up-to-date)
3. Reconfigure other replicas to follow new primary
4. Update clients/load balancers to point to new primary

Risks during failover:

Split-brain: Two nodes think they're the primary.
Data loss: Unreplicated writes on the old primary.
Downtime: Brief unavailability during transition.

Examples: PostgreSQL streaming replication, MySQL replication, MongoDB replica sets, Redis Sentinel.

2. Multi-Leader (Master-Master)

Multiple nodes accept writes simultaneously. Each leader replicates to others.

[Leader 1] ←──writes/reads
     ↕ replication
[Leader 2] ←──writes/reads
     ↕ replication
[Leader 3] ←──writes/reads

Advantages:

Write availability — no single write bottleneck.
Works well for multi-datacenter deployments (one leader per DC).
Better write latency (write to local leader).

Disadvantages:

Write conflicts: Same data modified on different leaders simultaneously.
Complex conflict resolution.
More complex to operate and debug.

Conflict resolution strategies:
| Strategy | Description |
|----------|-------------|
| Last-writer-wins (LWW) | Timestamp-based; simple but can lose data |
| Custom resolution | Application-specific merge logic |
| CRDTs | Data structures that automatically merge |
| Conflict avoidance | Route all writes for a key to the same leader |

Examples: CouchDB, MySQL Group Replication, Active Directory.

3. Leaderless (Peer-to-Peer)

No single leader — any node can accept reads and writes. Uses quorum-based consistency.

Client ──write──→ [Node 1] ✓
       ──write──→ [Node 2] ✓   (W=2 of N=3 must ACK)
       ──write──→ [Node 3] ✗   (this failure is tolerated)

Client ──read──→ [Node 1] → v2
       ──read──→ [Node 2] → v2   (R=2 of N=3 must respond)
       ──read──→ [Node 3] → v1   (stale, ignored — majority says v2)

Quorum condition:

Where = write quorum, = read quorum, = total replicas.

Advantages:

No single point of failure — any node can serve any request.
Tunable consistency (adjust W and R).
High availability for writes.

Disadvantages:

Read repair and anti-entropy needed to fix stale replicas.
More complex client logic.
Weaker consistency guarantees.

Anti-entropy mechanisms:
| Mechanism | Description |
|-----------|-------------|
| Read repair | On read, if a stale replica is detected, update it |
| Anti-entropy process | Background process compares replicas and syncs differences |
| Merkle trees | Efficient data structure to detect and sync differences |

Examples: Cassandra, DynamoDB, Riak, Voldemort.

Synchronous vs Asynchronous Replication

Synchronous Replication

The primary waits for replicas to confirm before acknowledging the write.

Client → Primary → Write to Replica 1 → ACK
                 → Write to Replica 2 → ACK
         Primary → ACK to Client (after ALL/SOME replicas confirm)

Pros	Cons
Strong consistency	Higher write latency
No data loss on primary failure	Write availability depends on replicas
Guaranteed durable	Slower overall throughput

Asynchronous Replication

The primary acknowledges the write immediately and replicates in the background.

Client → Primary → ACK to Client (immediate)
         Primary → Replicate to Replica 1 (background)
                 → Replicate to Replica 2 (background)

Pros	Cons
Low write latency	Replication lag (stale reads)
High write throughput	Data loss possible if primary fails before replication
Write availability independent of replicas	Eventual consistency

Semi-Synchronous

A compromise: wait for at least one replica before ACK.

Client → Primary → Write to Replica 1 → ACK   ← Wait for this one
                 → Write to Replica 2          ← Background (best effort)
         Primary → ACK to Client

Used by default in many production setups (e.g., MySQL semi-sync, PostgreSQL synchronous_commit).

Replication Lag

The delay between a write to the primary and its appearance on replicas.

Time →
Primary:  WRITE(x=5) ─────────────────────────────
Replica:             ──── lag ────── WRITE(x=5) ──
                                     ↑
                          Replication lag (e.g., 100ms)

Problems Caused by Lag

Problem	Description	Solution
Stale reads	User writes then reads, but reads from stale replica	Read-your-writes: route reads to primary after writes
Monotonic read violations	User sees newer data then older data on next read	Monotonic reads: route user to same replica
Causal ordering violations	Reply appears before the original post	Causal consistency: track dependencies

Reading Your Own Writes

User writes profile → Primary DB
User reads profile → Replica (stale!) → Sees old profile 😤

Fix: After writing, read from primary for a short window (e.g., 10 seconds)
     OR: Read from replica only if replica_lag < threshold

Replication Methods

Statement-Based Replication

Replicate the SQL statements themselves.

Compact and simple.
Problem: Non-deterministic functions (NOW(), RAND()) return different values.

Write-Ahead Log (WAL) Shipping

Ship the database's WAL to replicas.

Byte-level accurate.
Problem: Tied to storage engine version (can't replicate across versions).

Logical (Row-Based) Replication

Replicate the logical changes (insert/update/delete rows).

Engine-independent.
Can replicate between different database versions.
Used by: MySQL binlog, PostgreSQL logical replication.

Change Data Capture (CDC)

Capture changes from the database log and publish to a stream (e.g., Kafka).

Database → CDC (Debezium) → Kafka → Search Index
                                   → Cache
                                   → Analytics DB

Redundancy Patterns

Active-Passive (Hot Standby)

[Active Server] ←── All traffic
       ↕ heartbeat / replication
[Passive Server]  ← Takes over on failure

Passive server is idle (waste of resources) but ready to take over.
Failover time: seconds to minutes.

Active-Active

[Server 1] ←── Traffic (50%)
[Server 2] ←── Traffic (50%)

Both servers handle traffic simultaneously.
Better resource utilization.
Need conflict resolution for writes.

N+1 Redundancy

Run instances when you need to handle load. One instance can fail without impact.

Need 3 servers for peak load → Run 4 servers
If one fails → 3 healthy servers still handle peak load

N+2 Redundancy

Run instances to tolerate one failure AND one maintenance window simultaneously.

Multi-Datacenter Replication

┌──── US-East DC ─────┐      ┌──── EU-West DC ─────┐
│ [Leader]             │←────→│ [Leader/Replica]     │
│ [Replica] [Replica]  │      │ [Replica] [Replica]  │
└──────────────────────┘      └──────────────────────┘
              ↕                            ↕
┌──── US-West DC ─────┐      ┌──── APAC DC ─────────┐
│ [Replica]            │      │ [Replica]             │
│ [Replica] [Replica]  │      │ [Replica] [Replica]   │
└──────────────────────┘      └──────────────────────┘

Considerations:

Cross-DC latency: 50-200ms per replication message.
Asynchronous replication preferred (synchronous too slow across DCs).
Conflict resolution needed for multi-leader across DCs.
Data sovereignty: Some data must stay in specific regions.

Summary

Concept	Key Point
Single leader	Simple, no conflicts, but write bottleneck
Multi-leader	Multi-DC writes, but conflict resolution needed
Leaderless	Highly available, quorum-based consistency
Sync replication	Strong consistency, higher latency
Async replication	Low latency, eventual consistency, risk of data loss
Replication lag	Causes stale reads; mitigate with read-your-writes
Active-passive	Standby takes over on failure
Active-active	Both serve traffic; better utilization

Rule of thumb: Use single-leader replication with asynchronous replication for most applications. Use multi-leader for multi-datacenter setups. Use leaderless when high write availability is critical.