11. Storage Systems

Understanding different storage types and their trade-offs is critical for system design. The choice of storage affects performance, durability, cost, and scalability.


Storage Hierarchy

                Speed        Capacity      Cost/GB
  CPU Registers  ████████████  tiny         $$$$$$$
  L1 Cache       ███████████   KB           $$$$$$
  L2 Cache       ██████████    MB           $$$$$
  L3 Cache       █████████     MB           $$$$
  RAM            ███████       GB           $$$
  SSD            █████         TB           $$
  HDD            ███           TB           $
  Network Storage █            PB           ¢
  Tape/Glacier   █            PB           fraction of ¢

Types of Storage

1. Block Storage

Data is stored in fixed-size blocks (typically 512 B to 4 KB). Each block has an address. No file system — the OS or application manages how blocks are organized.

┌────┐┌────┐┌────┐┌────┐┌────┐
│BLK1││BLK2││BLK3││BLK4││BLK5│  ← Raw blocks on disk
└────┘└────┘└────┘└────┘└────┘

Characteristics:

  • Lowest level of storage abstraction.
  • High performance, low latency.
  • Each block is independently addressed.
  • No metadata, no hierarchy — just raw blocks.

Use cases:

  • Operating system boot volumes.
  • Databases (need direct disk access for performance).
  • Virtual machine disks.

Cloud examples:
| Service | Provider |
|---------|----------|
| EBS (Elastic Block Store) | AWS |
| Persistent Disk | Google Cloud |
| Azure Managed Disks | Azure |

Storage Types:
| Type | IOPS | Throughput | Use Case |
|------|------|-----------|----------|
| SSD (gp3) | 3,000-16,000 | 125-1,000 MB/s | General purpose |
| Provisioned IOPS SSD (io2) | Up to 256,000 | Up to 4,000 MB/s | High-performance databases |
| HDD (st1) | 500 | 500 MB/s | Big data, sequential reads |
| Cold HDD (sc1) | 250 | 250 MB/s | Infrequent access, archival |


2. File Storage

Data is stored as files organized in a hierarchical directory structure. Accessed via file paths.

/
├── home/
│   └── user/
│       ├── documents/
│       │   └── report.pdf
│       └── photos/
│           └── vacation.jpg
└── var/
    └── log/
        └── app.log

Characteristics:

  • Familiar hierarchy (directories and files).
  • Supports file-level operations (read, write, lock, permissions).
  • Can be shared across multiple servers (network file systems).

Protocols:
| Protocol | Description |
|----------|-------------|
| NFS (Network File System) | UNIX/Linux standard for network file sharing |
| SMB/CIFS | Windows standard for file sharing |
| HDFS | Hadoop Distributed File System for big data |

Cloud examples:
| Service | Provider |
|---------|----------|
| EFS (Elastic File System) | AWS |
| Filestore | Google Cloud |
| Azure Files | Azure |

Use cases:

  • Shared file systems across multiple servers.
  • Content management systems.
  • Machine learning training data.
  • Legacy applications needing a file system interface.

3. Object Storage

Data is stored as objects in a flat address space. Each object contains data, metadata, and a unique identifier.

┌──────────────────────────────────────┐
│ Bucket: "my-app-images"             │
│                                      │
│ ┌──────────────────────────────────┐│
│ │ Object: "users/123/avatar.jpg"  ││
│ │ Data: [binary image data]        ││
│ │ Metadata: {                      ││
│ │   content-type: "image/jpeg",    ││
│ │   size: 245760,                  ││
│ │   created: "2025-01-15",         ││
│ │   custom-tags: {"user": "123"}   ││
│ │ }                                ││
│ └──────────────────────────────────┘│
│                                      │
│ ┌──────────────────────────────────┐│
│ │ Object: "logs/2025/01/app.log"  ││
│ │ Data: [log text data]            ││
│ │ Metadata: { ... }               ││
│ └──────────────────────────────────┘│
└──────────────────────────────────────┘

Characteristics:

  • Flat namespace (no directories — "/" in key name is just convention).
  • Accessed via HTTP/REST API (PUT, GET, DELETE).
  • Rich metadata per object.
  • Virtually unlimited storage.
  • Eventually consistent (some services offer strong consistency now).
  • Not suitable for frequent small modifications (objects are immutable; must replace whole object).

Cloud examples:
| Service | Provider |
|---------|----------|
| S3 (Simple Storage Service) | AWS |
| Cloud Storage | Google Cloud |
| Azure Blob Storage | Azure |

S3 Storage Classes:
| Class | Availability | Use Case | Cost |
|-------|-------------|----------|------|
| Standard | 99.99% | Frequently accessed | (auto-tiered) |
| Standard-IA | 99.9% | Infrequent access | $$ |
| Glacier Instant | 99.9% | Archive with millisecond retrieval | $ |
| Glacier Flexible | 99.99% | Archive (minutes to hours retrieval) | ¢ |
| Glacier Deep Archive | 99.99% | Long-term archive (12-48 hour retrieval) | fraction of ¢ |

Use cases:

  • Static assets (images, videos, CSS, JS).
  • Backups and archives.
  • Data lake storage for analytics.
  • Log storage.
  • Content distribution (origin for CDN).

Storage Comparison

Aspect Block File Object
Access Block address File path HTTP API (key)
Structure Flat blocks Hierarchical Flat namespace
Metadata None Basic (permissions, dates) Rich custom metadata
Performance Highest Moderate Lower (HTTP overhead)
Scalability Limited (single volume) Moderate Virtually unlimited
Modification In-place In-place Replace entire object
Sharing Single attachment Multi-server (NFS) Universal (HTTP)
Cost $$$ $$ $
Best for Databases, OS Shared files, CMS Static assets, backups

Distributed File Systems

HDFS (Hadoop Distributed File System)

Designed for storing very large files across clusters of commodity hardware.

Client → NameNode (metadata: which blocks, which DataNodes)
                ↓
         [DataNode 1] [DataNode 2] [DataNode 3]
         Block A1      Block A1     Block A2
         Block B1      Block A2     Block B1
                       Block B2     Block B2

Features:

  • Files split into large blocks (128 MB default).
  • Each block replicated (default 3x).
  • NameNode manages metadata; DataNodes store blocks.
  • Optimized for sequential reads of large files.
  • Write-once, read-many pattern.

Google File System (GFS) / Colossus

Google's distributed file system that inspired HDFS:

  • Single master (metadata) + chunk servers.
  • 64 MB chunks with 3x replication.
  • Designed for large sequential reads/writes.
  • Colossus is the successor to GFS.

Database Storage Engines

B-Tree (Traditional)

Used by most SQL databases (PostgreSQL, MySQL InnoDB).

                    [10 | 20 | 30]
                   /    |     |    \
           [1-9] [11-19] [21-29] [31-40]
  • Balanced tree structure.
  • read and write.
  • Good for read-heavy workloads.
  • In-place updates (modify data on disk).

LSM Tree (Log-Structured Merge Tree)

Used by many NoSQL databases (Cassandra, RocksDB, LevelDB, HBase).

Write → MemTable (in-memory sorted buffer)
          ↓ (when full)
       SSTable L0 (on disk, immutable)
          ↓ (compaction)
       SSTable L1
          ↓ (compaction)
       SSTable L2
  • Writes go to in-memory buffer first → much faster writes.
  • Immutable sorted files on disk (SSTables).
  • Background compaction merges and sorts files.
  • writes (amortized), reads may need to check multiple levels.
  • Good for write-heavy workloads.

B-Tree vs LSM Tree

Aspect B-Tree LSM Tree
Read performance Excellent Good (may check multiple levels)
Write performance Good Excellent
Space amplification Low Higher (multiple copies during compaction)
Write amplification Higher (in-place updates) Lower (sequential writes)
Use case Read-heavy OLTP Write-heavy workloads
Examples PostgreSQL, MySQL Cassandra, RocksDB, LevelDB

Write-Ahead Log (WAL)

A durability mechanism: every change is first written to an append-only log before being applied to the actual data.

1. Client sends WRITE(x=5)
2. Server writes to WAL: "SET x=5" → fsync to disk
3. Server ACKs client
4. Server applies change to data structures (eventually)

On crash:
- Replay WAL entries to recover uncommitted changes

Used by virtually all databases: PostgreSQL (WAL), MySQL (redo log), Cassandra (commit log).


Data Serialization Formats

Format Type Human Readable Size Speed Schema
JSON Text Yes Large Slow No
XML Text Yes Largest Slowest Optional (XSD)
Protocol Buffers Binary No Small Very fast Required (.proto)
Avro Binary No Small Fast Required (JSON schema)
Thrift Binary No Small Fast Required (.thrift)
MessagePack Binary No Small Fast No
Parquet Binary (columnar) No Smallest Fast reads Embedded

Summary

Concept Key Point
Block storage Raw blocks, highest performance — for databases and VMs
File storage Hierarchical files, shareable — for shared content
Object storage HTTP-accessible, unlimited scale — for static assets and backups
B-Tree Balanced reads/writes — traditional SQL databases
LSM Tree Optimized for writes — NoSQL and write-heavy workloads
WAL Durability guarantee — log before apply

Rule of thumb: Use block storage for databases. Use object storage for static assets and backups. Use file storage when applications need a shared file system.