4. Databases

Databases are the backbone of almost every system. Choosing the right database and understanding its trade-offs is critical for building scalable, reliable systems.


SQL (Relational) Databases

Relational databases store data in tables with predefined schemas. They use Structured Query Language (SQL) for querying and enforce ACID properties.

ACID Properties

Property Description Example
Atomicity A transaction is all-or-nothing Transfer $100: debit AND credit both succeed or both fail
Consistency Database moves from one valid state to another Foreign key constraints are never violated
Isolation Concurrent transactions don't interfere Two users buying the last item — only one succeeds
Durability Committed data survives crashes Data written to disk/WAL before acknowledging commit

Isolation Levels

Level Dirty Read Non-Repeatable Read Phantom Read Performance
Read Uncommitted Yes Yes Yes Fastest
Read Committed No Yes Yes Fast
Repeatable Read No No Yes Moderate
Serializable No No No Slowest

Normalization

Organizing data to reduce redundancy:

Normal Form Rule Example
1NF Atomic values, no repeating groups Each cell has a single value
2NF 1NF + no partial dependencies Non-key columns depend on the entire primary key
3NF 2NF + no transitive dependencies Non-key columns depend only on the primary key
BCNF Every determinant is a candidate key Stricter version of 3NF

Denormalization

Intentionally adding redundancy to improve read performance at the cost of write complexity.

Normalized:
  orders(id, user_id, total)
  users(id, name, email)
  → JOIN to get user name with order

Denormalized:
  orders(id, user_id, user_name, user_email, total)
  → No JOIN needed, but must update user_name in multiple places

Indexing

An index creates a data structure (usually a B-tree or B+ tree) that allows lookups instead of full table scans.

Types of Indexes:

Type Description Use Case
Primary Index Automatically created on primary key Unique row lookups
Secondary Index Created on non-primary columns Query optimization
Composite Index Index on multiple columns Multi-column WHERE clauses
Covering Index Includes all columns needed by a query Avoids table lookup
Unique Index Enforces uniqueness Email, username fields
Full-Text Index For text search Search functionality
Partial Index Index on a subset of rows WHERE status = 'active'

Trade-offs:

  • Indexes speed up reads but slow down writes (index must be updated).
  • Indexes consume additional storage.
  • Too many indexes can hurt overall performance.
Database Key Features
PostgreSQL Advanced features, JSONB, full-text search, extensions
MySQL Widely used, good performance, InnoDB engine
SQL Server Enterprise, Windows ecosystem, T-SQL
Oracle Enterprise, partitioning, RAC (clustering)
SQLite Embedded, serverless, single-file database
CockroachDB Distributed SQL, Spanner-inspired, strong consistency
TiDB MySQL-compatible distributed SQL

NoSQL Databases

NoSQL ("Not Only SQL") databases provide flexible schemas and are designed for specific access patterns that relational databases handle poorly at scale.

Types of NoSQL Databases

1. Key-Value Stores

Store data as simple key-value pairs. Fastest for simple lookups.

Key: "user:123"  →  Value: {"name": "Alice", "age": 30}
Key: "session:abc"  →  Value: "eyJhbGciOi..."
Database Features
Redis In-memory, data structures, persistence options
DynamoDB Managed, auto-scaling, single-digit ms latency
Memcached In-memory, simple, multi-threaded
etcd Distributed, consistent, used for config/service discovery

Best for: Session management, caching, real-time leaderboards, feature flags.

2. Document Stores

Store data as semi-structured documents (JSON, BSON). Each document can have a different structure.

{
  "_id": "user123",
  "name": "Alice",
  "addresses": [
    {"city": "NYC", "zip": "10001"},
    {"city": "LA", "zip": "90001"}
  ],
  "preferences": {
    "theme": "dark",
    "notifications": true
  }
}
Database Features
MongoDB Flexible schema, aggregation pipeline, sharding
Couchbase JSON documents, built-in caching, SQL-like query (N1QL)
Firestore Managed, real-time sync, offline support

Best for: Content management, user profiles, catalogs, event logging.

3. Wide-Column Stores

Data is stored in tables with rows and dynamic columns. Each row can have different columns.

Row Key: "user:123"
  Column Family "profile":  { name: "Alice", age: 30 }
  Column Family "activity": { last_login: "2025-01-01", posts: 42 }

Row Key: "user:456"
  Column Family "profile":  { name: "Bob", email: "bob@example.com" }
  (no "activity" column family)
Database Features
Cassandra Highly available, linear scalability, tunable consistency
HBase Built on HDFS, strong consistency, Hadoop ecosystem
ScyllaDB C++ rewrite of Cassandra, better performance

Best for: Time-series data, IoT, messaging, recommendation engines.

4. Graph Databases

Store data as nodes (entities) and edges (relationships). Optimized for traversing relationships.

(Alice)--[FRIENDS_WITH]-->(Bob)--[WORKS_AT]-->(Google)
   |                         |
   +--[LIKES]-->(Post:123)   +--[FRIENDS_WITH]-->(Charlie)
Database Features
Neo4j Most popular, Cypher query language, ACID
Amazon Neptune Managed, supports Gremlin and SPARQL
JanusGraph Distributed, scales horizontally

Best for: Social networks, fraud detection, recommendation engines, knowledge graphs.


SQL vs NoSQL Comparison

Aspect SQL NoSQL
Schema Fixed, predefined Flexible, dynamic
Query Language SQL (standardized) Varies by database
Scaling Primarily vertical Primarily horizontal
ACID Full support Varies (often eventual consistency)
Joins Native support Limited or application-level
Data Model Tables with rows/columns Documents, KV, columns, graphs
Best For Complex queries, transactions High throughput, flexible data
Examples PostgreSQL, MySQL MongoDB, Cassandra, Redis

When to Choose SQL

  • Complex relationships between entities.
  • Need for transactions (banking, e-commerce).
  • Data structure is well-defined and stable.
  • Complex queries with joins, aggregations, subqueries.
  • Regulatory requirements for data integrity.

When to Choose NoSQL

  • Rapidly evolving schema.
  • Need for horizontal scalability.
  • High read/write throughput.
  • Hierarchical or nested data (JSON documents).
  • Specific access patterns (key lookups, time-series, graphs).

BASE Properties (NoSQL)

An alternative to ACID for distributed systems:

Property Description
Basically Available System guarantees availability (may return stale data)
Soft State System state may change over time even without input
Eventually Consistent Data will become consistent after some time

Database Scaling Techniques

Read Replicas

Writes ──→ Primary DB
Reads  ──→ Replica 1 / Replica 2 / Replica 3
  • Primary handles all writes.
  • Replicas asynchronously replicate from primary.
  • Read-heavy workloads benefit enormously.
  • Trade-off: Replication lag → eventual consistency for reads.

Sharding (Horizontal Partitioning)

Split data across multiple database instances by a shard key.

Shard 1: Users A-H
Shard 2: Users I-P
Shard 3: Users Q-Z

(See: 12-data-partitioning.md for detailed coverage.)

Connection Pooling

Maintain a pool of reusable database connections to avoid the overhead of creating new connections.

App Servers → Connection Pool (PgBouncer, HikariCP) → Database

Materialized Views

Pre-computed query results stored as a table. Refreshed periodically or on demand.

CREATE MATERIALIZED VIEW monthly_sales AS
SELECT product_id, SUM(amount) as total
FROM orders
WHERE order_date >= NOW() - INTERVAL '30 days'
GROUP BY product_id;

Database Transactions

Transaction Lifecycle

BEGIN → READ/WRITE → COMMIT (or ROLLBACK)

Distributed Transactions

Two-Phase Commit (2PC)

Phase 1 (Prepare): Coordinator asks all participants: "Can you commit?"
Phase 2 (Commit):  If all say YES → Coordinator says "Commit"
                   If any says NO → Coordinator says "Rollback"

Problem: Coordinator is a single point of failure. If it crashes between phases, participants are blocked.

Saga Pattern

A sequence of local transactions where each step has a compensating action.

Order Saga:
1. Create Order → (compensate: Cancel Order)
2. Reserve Inventory → (compensate: Release Inventory)
3. Process Payment → (compensate: Refund Payment)
4. Ship Order → (compensate: Cancel Shipment)

If step 3 fails: Run compensations for steps 2, then 1.

Types:

  • Choreography: Each service listens for events and acts.
  • Orchestration: A central coordinator directs the saga.

NewSQL Databases

Combine the scalability of NoSQL with the ACID guarantees of SQL.

Database Key Features
Google Spanner Globally distributed, strongly consistent, TrueTime
CockroachDB PostgreSQL-compatible, geo-distributed, serializable
TiDB MySQL-compatible, horizontal scaling, HTAP
YugabyteDB PostgreSQL-compatible, distributed, cloud-native
VoltDB In-memory, ACID, high throughput

Choosing a Database — Decision Framework

Do you need ACID transactions?
├── Yes → Is your data relational?
│         ├── Yes → PostgreSQL / MySQL
│         └── No → MongoDB (with transactions) / CockroachDB
└── No → What's your primary access pattern?
          ├── Key-Value lookups → Redis / DynamoDB
          ├── Document storage → MongoDB / Couchbase
          ├── Time-series → Cassandra / InfluxDB / TimescaleDB
          ├── Graph traversal → Neo4j / Neptune
          ├── Full-text search → Elasticsearch / OpenSearch
          └── Analytics/OLAP → ClickHouse / BigQuery / Redshift

Summary

Concept Key Point
SQL Structured, ACID, strong consistency, vertical scaling
NoSQL Flexible schema, horizontal scaling, eventual consistency
ACID vs BASE ACID = correctness; BASE = availability
Indexing Critical for read performance; costs write performance
Scaling Read replicas for reads; sharding for writes
Transactions 2PC for strong consistency; Saga for eventual consistency
NewSQL Best of both worlds — distributed + ACID