4. Databases
Databases are the backbone of almost every system. Choosing the right database and understanding its trade-offs is critical for building scalable, reliable systems.
SQL (Relational) Databases
Relational databases store data in tables with predefined schemas. They use Structured Query Language (SQL) for querying and enforce ACID properties.
ACID Properties
| Property | Description | Example |
|---|---|---|
| Atomicity | A transaction is all-or-nothing | Transfer $100: debit AND credit both succeed or both fail |
| Consistency | Database moves from one valid state to another | Foreign key constraints are never violated |
| Isolation | Concurrent transactions don't interfere | Two users buying the last item — only one succeeds |
| Durability | Committed data survives crashes | Data written to disk/WAL before acknowledging commit |
Isolation Levels
| Level | Dirty Read | Non-Repeatable Read | Phantom Read | Performance |
|---|---|---|---|---|
| Read Uncommitted | Yes | Yes | Yes | Fastest |
| Read Committed | No | Yes | Yes | Fast |
| Repeatable Read | No | No | Yes | Moderate |
| Serializable | No | No | No | Slowest |
Normalization
Organizing data to reduce redundancy:
| Normal Form | Rule | Example |
|---|---|---|
| 1NF | Atomic values, no repeating groups | Each cell has a single value |
| 2NF | 1NF + no partial dependencies | Non-key columns depend on the entire primary key |
| 3NF | 2NF + no transitive dependencies | Non-key columns depend only on the primary key |
| BCNF | Every determinant is a candidate key | Stricter version of 3NF |
Denormalization
Intentionally adding redundancy to improve read performance at the cost of write complexity.
Normalized:
orders(id, user_id, total)
users(id, name, email)
→ JOIN to get user name with order
Denormalized:
orders(id, user_id, user_name, user_email, total)
→ No JOIN needed, but must update user_name in multiple places
Indexing
An index creates a data structure (usually a B-tree or B+ tree) that allows
Types of Indexes:
| Type | Description | Use Case |
|---|---|---|
| Primary Index | Automatically created on primary key | Unique row lookups |
| Secondary Index | Created on non-primary columns | Query optimization |
| Composite Index | Index on multiple columns | Multi-column WHERE clauses |
| Covering Index | Includes all columns needed by a query | Avoids table lookup |
| Unique Index | Enforces uniqueness | Email, username fields |
| Full-Text Index | For text search | Search functionality |
| Partial Index | Index on a subset of rows | WHERE status = 'active' |
Trade-offs:
- Indexes speed up reads but slow down writes (index must be updated).
- Indexes consume additional storage.
- Too many indexes can hurt overall performance.
Popular SQL Databases
| Database | Key Features |
|---|---|
| PostgreSQL | Advanced features, JSONB, full-text search, extensions |
| MySQL | Widely used, good performance, InnoDB engine |
| SQL Server | Enterprise, Windows ecosystem, T-SQL |
| Oracle | Enterprise, partitioning, RAC (clustering) |
| SQLite | Embedded, serverless, single-file database |
| CockroachDB | Distributed SQL, Spanner-inspired, strong consistency |
| TiDB | MySQL-compatible distributed SQL |
NoSQL Databases
NoSQL ("Not Only SQL") databases provide flexible schemas and are designed for specific access patterns that relational databases handle poorly at scale.
Types of NoSQL Databases
1. Key-Value Stores
Store data as simple key-value pairs. Fastest for simple lookups.
Key: "user:123" → Value: {"name": "Alice", "age": 30}
Key: "session:abc" → Value: "eyJhbGciOi..."
| Database | Features |
|---|---|
| Redis | In-memory, data structures, persistence options |
| DynamoDB | Managed, auto-scaling, single-digit ms latency |
| Memcached | In-memory, simple, multi-threaded |
| etcd | Distributed, consistent, used for config/service discovery |
Best for: Session management, caching, real-time leaderboards, feature flags.
2. Document Stores
Store data as semi-structured documents (JSON, BSON). Each document can have a different structure.
{
"_id": "user123",
"name": "Alice",
"addresses": [
{"city": "NYC", "zip": "10001"},
{"city": "LA", "zip": "90001"}
],
"preferences": {
"theme": "dark",
"notifications": true
}
}
| Database | Features |
|---|---|
| MongoDB | Flexible schema, aggregation pipeline, sharding |
| Couchbase | JSON documents, built-in caching, SQL-like query (N1QL) |
| Firestore | Managed, real-time sync, offline support |
Best for: Content management, user profiles, catalogs, event logging.
3. Wide-Column Stores
Data is stored in tables with rows and dynamic columns. Each row can have different columns.
Row Key: "user:123"
Column Family "profile": { name: "Alice", age: 30 }
Column Family "activity": { last_login: "2025-01-01", posts: 42 }
Row Key: "user:456"
Column Family "profile": { name: "Bob", email: "bob@example.com" }
(no "activity" column family)
| Database | Features |
|---|---|
| Cassandra | Highly available, linear scalability, tunable consistency |
| HBase | Built on HDFS, strong consistency, Hadoop ecosystem |
| ScyllaDB | C++ rewrite of Cassandra, better performance |
Best for: Time-series data, IoT, messaging, recommendation engines.
4. Graph Databases
Store data as nodes (entities) and edges (relationships). Optimized for traversing relationships.
(Alice)--[FRIENDS_WITH]-->(Bob)--[WORKS_AT]-->(Google)
| |
+--[LIKES]-->(Post:123) +--[FRIENDS_WITH]-->(Charlie)
| Database | Features |
|---|---|
| Neo4j | Most popular, Cypher query language, ACID |
| Amazon Neptune | Managed, supports Gremlin and SPARQL |
| JanusGraph | Distributed, scales horizontally |
Best for: Social networks, fraud detection, recommendation engines, knowledge graphs.
SQL vs NoSQL Comparison
| Aspect | SQL | NoSQL |
|---|---|---|
| Schema | Fixed, predefined | Flexible, dynamic |
| Query Language | SQL (standardized) | Varies by database |
| Scaling | Primarily vertical | Primarily horizontal |
| ACID | Full support | Varies (often eventual consistency) |
| Joins | Native support | Limited or application-level |
| Data Model | Tables with rows/columns | Documents, KV, columns, graphs |
| Best For | Complex queries, transactions | High throughput, flexible data |
| Examples | PostgreSQL, MySQL | MongoDB, Cassandra, Redis |
When to Choose SQL
- Complex relationships between entities.
- Need for transactions (banking, e-commerce).
- Data structure is well-defined and stable.
- Complex queries with joins, aggregations, subqueries.
- Regulatory requirements for data integrity.
When to Choose NoSQL
- Rapidly evolving schema.
- Need for horizontal scalability.
- High read/write throughput.
- Hierarchical or nested data (JSON documents).
- Specific access patterns (key lookups, time-series, graphs).
BASE Properties (NoSQL)
An alternative to ACID for distributed systems:
| Property | Description |
|---|---|
| Basically Available | System guarantees availability (may return stale data) |
| Soft State | System state may change over time even without input |
| Eventually Consistent | Data will become consistent after some time |
Database Scaling Techniques
Read Replicas
Writes ──→ Primary DB
Reads ──→ Replica 1 / Replica 2 / Replica 3
- Primary handles all writes.
- Replicas asynchronously replicate from primary.
- Read-heavy workloads benefit enormously.
- Trade-off: Replication lag → eventual consistency for reads.
Sharding (Horizontal Partitioning)
Split data across multiple database instances by a shard key.
Shard 1: Users A-H
Shard 2: Users I-P
Shard 3: Users Q-Z
(See: 12-data-partitioning.md for detailed coverage.)
Connection Pooling
Maintain a pool of reusable database connections to avoid the overhead of creating new connections.
App Servers → Connection Pool (PgBouncer, HikariCP) → Database
Materialized Views
Pre-computed query results stored as a table. Refreshed periodically or on demand.
CREATE MATERIALIZED VIEW monthly_sales AS
SELECT product_id, SUM(amount) as total
FROM orders
WHERE order_date >= NOW() - INTERVAL '30 days'
GROUP BY product_id;
Database Transactions
Transaction Lifecycle
BEGIN → READ/WRITE → COMMIT (or ROLLBACK)
Distributed Transactions
Two-Phase Commit (2PC)
Phase 1 (Prepare): Coordinator asks all participants: "Can you commit?"
Phase 2 (Commit): If all say YES → Coordinator says "Commit"
If any says NO → Coordinator says "Rollback"
Problem: Coordinator is a single point of failure. If it crashes between phases, participants are blocked.
Saga Pattern
A sequence of local transactions where each step has a compensating action.
Order Saga:
1. Create Order → (compensate: Cancel Order)
2. Reserve Inventory → (compensate: Release Inventory)
3. Process Payment → (compensate: Refund Payment)
4. Ship Order → (compensate: Cancel Shipment)
If step 3 fails: Run compensations for steps 2, then 1.
Types:
- Choreography: Each service listens for events and acts.
- Orchestration: A central coordinator directs the saga.
NewSQL Databases
Combine the scalability of NoSQL with the ACID guarantees of SQL.
| Database | Key Features |
|---|---|
| Google Spanner | Globally distributed, strongly consistent, TrueTime |
| CockroachDB | PostgreSQL-compatible, geo-distributed, serializable |
| TiDB | MySQL-compatible, horizontal scaling, HTAP |
| YugabyteDB | PostgreSQL-compatible, distributed, cloud-native |
| VoltDB | In-memory, ACID, high throughput |
Choosing a Database — Decision Framework
Do you need ACID transactions?
├── Yes → Is your data relational?
│ ├── Yes → PostgreSQL / MySQL
│ └── No → MongoDB (with transactions) / CockroachDB
└── No → What's your primary access pattern?
├── Key-Value lookups → Redis / DynamoDB
├── Document storage → MongoDB / Couchbase
├── Time-series → Cassandra / InfluxDB / TimescaleDB
├── Graph traversal → Neo4j / Neptune
├── Full-text search → Elasticsearch / OpenSearch
└── Analytics/OLAP → ClickHouse / BigQuery / Redshift
Summary
| Concept | Key Point |
|---|---|
| SQL | Structured, ACID, strong consistency, vertical scaling |
| NoSQL | Flexible schema, horizontal scaling, eventual consistency |
| ACID vs BASE | ACID = correctness; BASE = availability |
| Indexing | Critical for read performance; costs write performance |
| Scaling | Read replicas for reads; sharding for writes |
| Transactions | 2PC for strong consistency; Saga for eventual consistency |
| NewSQL | Best of both worlds — distributed + ACID |