04. Databases

title: 04. Databases

Databases are the backbone of almost every system. Choosing the right database and understanding its trade-offs is critical for building scalable, reliable systems.

SQL (Relational) Databases

Relational databases store data in tables with predefined schemas. They use Structured Query Language (SQL) for querying and enforce ACID properties.

ACID Properties

Property	Description	Example
Atomicity	A transaction is all-or-nothing	Transfer $100: debit AND credit both succeed or both fail
Consistency	Database moves from one valid state to another	Foreign key constraints are never violated
Isolation	Concurrent transactions don't interfere	Two users buying the last item — only one succeeds
Durability	Committed data survives crashes	Data written to disk/WAL before acknowledging commit

Isolation Levels

Level	Dirty Read	Non-Repeatable Read	Phantom Read	Performance
Read Uncommitted	Yes	Yes	Yes	Fastest
Read Committed	No	Yes	Yes	Fast
Repeatable Read	No	No	Yes	Moderate
Serializable	No	No	No	Slowest

Normalization

Organizing data to reduce redundancy:

Normal Form	Rule	Example
1NF	Atomic values, no repeating groups	Each cell has a single value
2NF	1NF + no partial dependencies	Non-key columns depend on the entire primary key
3NF	2NF + no transitive dependencies	Non-key columns depend only on the primary key
BCNF	Every determinant is a candidate key	Stricter version of 3NF

Denormalization

Intentionally adding redundancy to improve read performance at the cost of write complexity.

Normalized:
  orders(id, user_id, total)
  users(id, name, email)
  → JOIN to get user name with order

Denormalized:
  orders(id, user_id, user_name, user_email, total)
  → No JOIN needed, but must update user_name in multiple places

Indexing

An index creates a data structure (usually a B-tree or B+ tree) that allows lookups instead of full table scans.

Types of Indexes:

Type	Description	Use Case
Primary Index	Automatically created on primary key	Unique row lookups
Secondary Index	Created on non-primary columns	Query optimization
Composite Index	Index on multiple columns	Multi-column WHERE clauses
Covering Index	Includes all columns needed by a query	Avoids table lookup
Unique Index	Enforces uniqueness	Email, username fields
Full-Text Index	For text search	Search functionality
Partial Index	Index on a subset of rows	`WHERE status = 'active'`

Trade-offs:

Indexes speed up reads but slow down writes (index must be updated).
Indexes consume additional storage.
Too many indexes can hurt overall performance.

Popular SQL Databases

Database	Key Features
PostgreSQL	Advanced features, JSONB, full-text search, extensions
MySQL	Widely used, good performance, InnoDB engine
SQL Server	Enterprise, Windows ecosystem, T-SQL
Oracle	Enterprise, partitioning, RAC (clustering)
SQLite	Embedded, serverless, single-file database
CockroachDB	Distributed SQL, Spanner-inspired, strong consistency
TiDB	MySQL-compatible distributed SQL

NoSQL Databases

NoSQL ("Not Only SQL") databases provide flexible schemas and are designed for specific access patterns that relational databases handle poorly at scale.

Types of NoSQL Databases

1. Key-Value Stores

Store data as simple key-value pairs. Fastest for simple lookups.

Key: "user:123"  →  Value: {"name": "Alice", "age": 30}
Key: "session:abc"  →  Value: "eyJhbGciOi..."

Database	Features
Redis	In-memory, data structures, persistence options
DynamoDB	Managed, auto-scaling, single-digit ms latency
Memcached	In-memory, simple, multi-threaded
etcd	Distributed, consistent, used for config/service discovery

Best for: Session management, caching, real-time leaderboards, feature flags.

2. Document Stores

Store data as semi-structured documents (JSON, BSON). Each document can have a different structure.

{
  "_id": "user123",
  "name": "Alice",
  "addresses": [
    {"city": "NYC", "zip": "10001"},
    {"city": "LA", "zip": "90001"}
  ],
  "preferences": {
    "theme": "dark",
    "notifications": true
  }
}

Database	Features
MongoDB	Flexible schema, aggregation pipeline, sharding
Couchbase	JSON documents, built-in caching, SQL-like query (N1QL)
Firestore	Managed, real-time sync, offline support

Best for: Content management, user profiles, catalogs, event logging.

3. Wide-Column Stores

Data is stored in tables with rows and dynamic columns. Each row can have different columns.

Row Key: "user:123"
  Column Family "profile":  { name: "Alice", age: 30 }
  Column Family "activity": { last_login: "2025-01-01", posts: 42 }

Row Key: "user:456"
  Column Family "profile":  { name: "Bob", email: "bob@example.com" }
  (no "activity" column family)

Database	Features
Cassandra	Highly available, linear scalability, tunable consistency
HBase	Built on HDFS, strong consistency, Hadoop ecosystem
ScyllaDB	C++ rewrite of Cassandra, better performance

Best for: Time-series data, IoT, messaging, recommendation engines.

4. Graph Databases

Store data as nodes (entities) and edges (relationships). Optimized for traversing relationships.

(Alice)--[FRIENDS_WITH]-->(Bob)--[WORKS_AT]-->(Google)
   |                         |
   +--[LIKES]-->(Post:123)   +--[FRIENDS_WITH]-->(Charlie)

Database	Features
Neo4j	Most popular, Cypher query language, ACID
Amazon Neptune	Managed, supports Gremlin and SPARQL
JanusGraph	Distributed, scales horizontally

Best for: Social networks, fraud detection, recommendation engines, knowledge graphs.

SQL vs NoSQL Comparison

Aspect	SQL	NoSQL
Schema	Fixed, predefined	Flexible, dynamic
Query Language	SQL (standardized)	Varies by database
Scaling	Primarily vertical	Primarily horizontal
ACID	Full support	Varies (often eventual consistency)
Joins	Native support	Limited or application-level
Data Model	Tables with rows/columns	Documents, KV, columns, graphs
Best For	Complex queries, transactions	High throughput, flexible data
Examples	PostgreSQL, MySQL	MongoDB, Cassandra, Redis

When to Choose SQL

Complex relationships between entities.
Need for transactions (banking, e-commerce).
Data structure is well-defined and stable.
Complex queries with joins, aggregations, subqueries.
Regulatory requirements for data integrity.

When to Choose NoSQL

Rapidly evolving schema.
Need for horizontal scalability.
High read/write throughput.
Hierarchical or nested data (JSON documents).
Specific access patterns (key lookups, time-series, graphs).

BASE Properties (NoSQL)

An alternative to ACID for distributed systems:

Property	Description
Basically Available	System guarantees availability (may return stale data)
Soft State	System state may change over time even without input
Eventually Consistent	Data will become consistent after some time

Database Scaling Techniques

Read Replicas

Writes ──→ Primary DB
Reads  ──→ Replica 1 / Replica 2 / Replica 3

Primary handles all writes.
Replicas asynchronously replicate from primary.
Read-heavy workloads benefit enormously.
Trade-off: Replication lag → eventual consistency for reads.

Sharding (Horizontal Partitioning)

Split data across multiple database instances by a shard key.

Shard 1: Users A-H
Shard 2: Users I-P
Shard 3: Users Q-Z

(See: 12-data-partitioning.md for detailed coverage.)

Connection Pooling

Maintain a pool of reusable database connections to avoid the overhead of creating new connections.

App Servers → Connection Pool (PgBouncer, HikariCP) → Database

Materialized Views

Pre-computed query results stored as a table. Refreshed periodically or on demand.

CREATE MATERIALIZED VIEW monthly_sales AS
SELECT product_id, SUM(amount) as total
FROM orders
WHERE order_date >= NOW() - INTERVAL '30 days'
GROUP BY product_id;

Database Transactions

Transaction Lifecycle

BEGIN → READ/WRITE → COMMIT (or ROLLBACK)

Distributed Transactions

Two-Phase Commit (2PC)

Phase 1 (Prepare): Coordinator asks all participants: "Can you commit?"
Phase 2 (Commit):  If all say YES → Coordinator says "Commit"
                   If any says NO → Coordinator says "Rollback"

Problem: Coordinator is a single point of failure. If it crashes between phases, participants are blocked.

Saga Pattern

A sequence of local transactions where each step has a compensating action.

Order Saga:
1. Create Order → (compensate: Cancel Order)
2. Reserve Inventory → (compensate: Release Inventory)
3. Process Payment → (compensate: Refund Payment)
4. Ship Order → (compensate: Cancel Shipment)

If step 3 fails: Run compensations for steps 2, then 1.

Types:

Choreography: Each service listens for events and acts.
Orchestration: A central coordinator directs the saga.

NewSQL Databases

Combine the scalability of NoSQL with the ACID guarantees of SQL.

Database	Key Features
Google Spanner	Globally distributed, strongly consistent, TrueTime
CockroachDB	PostgreSQL-compatible, geo-distributed, serializable
TiDB	MySQL-compatible, horizontal scaling, HTAP
YugabyteDB	PostgreSQL-compatible, distributed, cloud-native
VoltDB	In-memory, ACID, high throughput

Choosing a Database — Decision Framework

Do you need ACID transactions?
├── Yes → Is your data relational?
│         ├── Yes → PostgreSQL / MySQL
│         └── No → MongoDB (with transactions) / CockroachDB
└── No → What's your primary access pattern?
          ├── Key-Value lookups → Redis / DynamoDB
          ├── Document storage → MongoDB / Couchbase
          ├── Time-series → Cassandra / InfluxDB / TimescaleDB
          ├── Graph traversal → Neo4j / Neptune
          ├── Full-text search → Elasticsearch / OpenSearch
          └── Analytics/OLAP → ClickHouse / BigQuery / Redshift

Summary

Concept	Key Point
SQL	Structured, ACID, strong consistency, vertical scaling
NoSQL	Flexible schema, horizontal scaling, eventual consistency
ACID vs BASE	ACID = correctness; BASE = availability
Indexing	Critical for read performance; costs write performance
Scaling	Read replicas for reads; sharding for writes
Transactions	2PC for strong consistency; Saga for eventual consistency
NewSQL	Best of both worlds — distributed + ACID