11. Storage Systems

title: 11. Storage Systems

Understanding different storage types and their trade-offs is critical for system design. The choice of storage affects performance, durability, cost, and scalability.

Storage Hierarchy

                Speed        Capacity      Cost/GB
  CPU Registers  ████████████  tiny         $$$$$$$
  L1 Cache       ███████████   KB           $$$$$$
  L2 Cache       ██████████    MB           $$$$$
  L3 Cache       █████████     MB           $$$$
  RAM            ███████       GB           $$$
  SSD            █████         TB           $$
  HDD            ███           TB           $
  Network Storage █            PB           ¢
  Tape/Glacier   █            PB           fraction of ¢

Types of Storage

1. Block Storage

Data is stored in fixed-size blocks (typically 512 B to 4 KB). Each block has an address. No file system — the OS or application manages how blocks are organized.

┌────┐┌────┐┌────┐┌────┐┌────┐
│BLK1││BLK2││BLK3││BLK4││BLK5│  ← Raw blocks on disk
└────┘└────┘└────┘└────┘└────┘

Characteristics:

Lowest level of storage abstraction.
High performance, low latency.
Each block is independently addressed.
No metadata, no hierarchy — just raw blocks.

Use cases:

Operating system boot volumes.
Databases (need direct disk access for performance).
Virtual machine disks.

Cloud examples:
| Service | Provider |
|---------|----------|
| EBS (Elastic Block Store) | AWS |
| Persistent Disk | Google Cloud |
| Azure Managed Disks | Azure |

Storage Types:
| Type | IOPS | Throughput | Use Case |
|------|------|-----------|----------|
| SSD (gp3) | 3,000-16,000 | 125-1,000 MB/s | General purpose |
| Provisioned IOPS SSD (io2) | Up to 256,000 | Up to 4,000 MB/s | High-performance databases |
| HDD (st1) | 500 | 500 MB/s | Big data, sequential reads |
| Cold HDD (sc1) | 250 | 250 MB/s | Infrequent access, archival |

2. File Storage

Data is stored as files organized in a hierarchical directory structure. Accessed via file paths.

/
├── home/
│   └── user/
│       ├── documents/
│       │   └── report.pdf
│       └── photos/
│           └── vacation.jpg
└── var/
    └── log/
        └── app.log

Characteristics:

Familiar hierarchy (directories and files).
Supports file-level operations (read, write, lock, permissions).
Can be shared across multiple servers (network file systems).

Protocols:
| Protocol | Description |
|----------|-------------|
| NFS (Network File System) | UNIX/Linux standard for network file sharing |
| SMB/CIFS | Windows standard for file sharing |
| HDFS | Hadoop Distributed File System for big data |

Cloud examples:
| Service | Provider |
|---------|----------|
| EFS (Elastic File System) | AWS |
| Filestore | Google Cloud |
| Azure Files | Azure |

Use cases:

Shared file systems across multiple servers.
Content management systems.
Machine learning training data.
Legacy applications needing a file system interface.

3. Object Storage

Data is stored as objects in a flat address space. Each object contains data, metadata, and a unique identifier.

┌──────────────────────────────────────┐
│ Bucket: "my-app-images"             │
│                                      │
│ ┌──────────────────────────────────┐│
│ │ Object: "users/123/avatar.jpg"  ││
│ │ Data: [binary image data]        ││
│ │ Metadata: {                      ││
│ │   content-type: "image/jpeg",    ││
│ │   size: 245760,                  ││
│ │   created: "2025-01-15",         ││
│ │   custom-tags: {"user": "123"}   ││
│ │ }                                ││
│ └──────────────────────────────────┘│
│                                      │
│ ┌──────────────────────────────────┐│
│ │ Object: "logs/2025/01/app.log"  ││
│ │ Data: [log text data]            ││
│ │ Metadata: { ... }               ││
│ └──────────────────────────────────┘│
└──────────────────────────────────────┘

Characteristics:

Flat namespace (no directories — "/" in key name is just convention).
Accessed via HTTP/REST API (PUT, GET, DELETE).
Rich metadata per object.
Virtually unlimited storage.
Eventually consistent (some services offer strong consistency now).
Not suitable for frequent small modifications (objects are immutable; must replace whole object).

Cloud examples:
| Service | Provider |
|---------|----------|
| S3 (Simple Storage Service) | AWS |
| Cloud Storage | Google Cloud |
| Azure Blob Storage | Azure |

S3 Storage Classes:
| Class | Availability | Use Case | Cost |
|-------|-------------|----------|------|
| Standard | 99.99% | Frequently accessed | (auto-tiered) |
| Standard-IA | 99.9% | Infrequent access | $$ |
| Glacier Instant | 99.9% | Archive with millisecond retrieval | $ |
| Glacier Flexible | 99.99% | Archive (minutes to hours retrieval) | ¢ |
| Glacier Deep Archive | 99.99% | Long-term archive (12-48 hour retrieval) | fraction of ¢ |

Use cases:

Static assets (images, videos, CSS, JS).
Backups and archives.
Data lake storage for analytics.
Log storage.
Content distribution (origin for CDN).

Storage Comparison

Aspect	Block	File	Object
Access	Block address	File path	HTTP API (key)
Structure	Flat blocks	Hierarchical	Flat namespace
Metadata	None	Basic (permissions, dates)	Rich custom metadata
Performance	Highest	Moderate	Lower (HTTP overhead)
Scalability	Limited (single volume)	Moderate	Virtually unlimited
Modification	In-place	In-place	Replace entire object
Sharing	Single attachment	Multi-server (NFS)	Universal (HTTP)
Cost	$$$	$$	$
Best for	Databases, OS	Shared files, CMS	Static assets, backups

Distributed File Systems

HDFS (Hadoop Distributed File System)

Designed for storing very large files across clusters of commodity hardware.

Client → NameNode (metadata: which blocks, which DataNodes)
                ↓
         [DataNode 1] [DataNode 2] [DataNode 3]
         Block A1      Block A1     Block A2
         Block B1      Block A2     Block B1
                       Block B2     Block B2

Features:

Files split into large blocks (128 MB default).
Each block replicated (default 3x).
NameNode manages metadata; DataNodes store blocks.
Optimized for sequential reads of large files.
Write-once, read-many pattern.

Google File System (GFS) / Colossus

Google's distributed file system that inspired HDFS:

Single master (metadata) + chunk servers.
64 MB chunks with 3x replication.
Designed for large sequential reads/writes.
Colossus is the successor to GFS.

Database Storage Engines

B-Tree (Traditional)

Used by most SQL databases (PostgreSQL, MySQL InnoDB).

                    [10 | 20 | 30]
                   /    |     |    \
           [1-9] [11-19] [21-29] [31-40]

Balanced tree structure.
read and write.
Good for read-heavy workloads.
In-place updates (modify data on disk).

LSM Tree (Log-Structured Merge Tree)

Used by many NoSQL databases (Cassandra, RocksDB, LevelDB, HBase).

Write → MemTable (in-memory sorted buffer)
          ↓ (when full)
       SSTable L0 (on disk, immutable)
          ↓ (compaction)
       SSTable L1
          ↓ (compaction)
       SSTable L2

Writes go to in-memory buffer first → much faster writes.
Immutable sorted files on disk (SSTables).
Background compaction merges and sorts files.
writes (amortized), reads may need to check multiple levels.
Good for write-heavy workloads.

B-Tree vs LSM Tree

Aspect	B-Tree	LSM Tree
Read performance	Excellent	Good (may check multiple levels)
Write performance	Good	Excellent
Space amplification	Low	Higher (multiple copies during compaction)
Write amplification	Higher (in-place updates)	Lower (sequential writes)
Use case	Read-heavy OLTP	Write-heavy workloads
Examples	PostgreSQL, MySQL	Cassandra, RocksDB, LevelDB

Write-Ahead Log (WAL)

A durability mechanism: every change is first written to an append-only log before being applied to the actual data.

1. Client sends WRITE(x=5)
2. Server writes to WAL: "SET x=5" → fsync to disk
3. Server ACKs client
4. Server applies change to data structures (eventually)

On crash:
- Replay WAL entries to recover uncommitted changes

Used by virtually all databases: PostgreSQL (WAL), MySQL (redo log), Cassandra (commit log).

Data Serialization Formats

Format	Type	Human Readable	Size	Speed	Schema
JSON	Text	Yes	Large	Slow	No
XML	Text	Yes	Largest	Slowest	Optional (XSD)
Protocol Buffers	Binary	No	Small	Very fast	Required (.proto)
Avro	Binary	No	Small	Fast	Required (JSON schema)
Thrift	Binary	No	Small	Fast	Required (.thrift)
MessagePack	Binary	No	Small	Fast	No
Parquet	Binary (columnar)	No	Smallest	Fast reads	Embedded

Summary

Concept	Key Point
Block storage	Raw blocks, highest performance — for databases and VMs
File storage	Hierarchical files, shareable — for shared content
Object storage	HTTP-accessible, unlimited scale — for static assets and backups
B-Tree	Balanced reads/writes — traditional SQL databases
LSM Tree	Optimized for writes — NoSQL and write-heavy workloads
WAL	Durability guarantee — log before apply

Rule of thumb: Use block storage for databases. Use object storage for static assets and backups. Use file storage when applications need a shared file system.