11. Storage Systems
Understanding different storage types and their trade-offs is critical for system design. The choice of storage affects performance, durability, cost, and scalability.
Storage Hierarchy
Speed Capacity Cost/GB
CPU Registers ████████████ tiny $$$$$$$
L1 Cache ███████████ KB $$$$$$
L2 Cache ██████████ MB $$$$$
L3 Cache █████████ MB $$$$
RAM ███████ GB $$$
SSD █████ TB $$
HDD ███ TB $
Network Storage █ PB ¢
Tape/Glacier █ PB fraction of ¢
Types of Storage
1. Block Storage
Data is stored in fixed-size blocks (typically 512 B to 4 KB). Each block has an address. No file system — the OS or application manages how blocks are organized.
┌────┐┌────┐┌────┐┌────┐┌────┐
│BLK1││BLK2││BLK3││BLK4││BLK5│ ← Raw blocks on disk
└────┘└────┘└────┘└────┘└────┘
Characteristics:
- Lowest level of storage abstraction.
- High performance, low latency.
- Each block is independently addressed.
- No metadata, no hierarchy — just raw blocks.
Use cases:
- Operating system boot volumes.
- Databases (need direct disk access for performance).
- Virtual machine disks.
Cloud examples:
| Service | Provider |
|---------|----------|
| EBS (Elastic Block Store) | AWS |
| Persistent Disk | Google Cloud |
| Azure Managed Disks | Azure |
Storage Types:
| Type | IOPS | Throughput | Use Case |
|------|------|-----------|----------|
| SSD (gp3) | 3,000-16,000 | 125-1,000 MB/s | General purpose |
| Provisioned IOPS SSD (io2) | Up to 256,000 | Up to 4,000 MB/s | High-performance databases |
| HDD (st1) | 500 | 500 MB/s | Big data, sequential reads |
| Cold HDD (sc1) | 250 | 250 MB/s | Infrequent access, archival |
2. File Storage
Data is stored as files organized in a hierarchical directory structure. Accessed via file paths.
/
├── home/
│ └── user/
│ ├── documents/
│ │ └── report.pdf
│ └── photos/
│ └── vacation.jpg
└── var/
└── log/
└── app.log
Characteristics:
- Familiar hierarchy (directories and files).
- Supports file-level operations (read, write, lock, permissions).
- Can be shared across multiple servers (network file systems).
Protocols:
| Protocol | Description |
|----------|-------------|
| NFS (Network File System) | UNIX/Linux standard for network file sharing |
| SMB/CIFS | Windows standard for file sharing |
| HDFS | Hadoop Distributed File System for big data |
Cloud examples:
| Service | Provider |
|---------|----------|
| EFS (Elastic File System) | AWS |
| Filestore | Google Cloud |
| Azure Files | Azure |
Use cases:
- Shared file systems across multiple servers.
- Content management systems.
- Machine learning training data.
- Legacy applications needing a file system interface.
3. Object Storage
Data is stored as objects in a flat address space. Each object contains data, metadata, and a unique identifier.
┌──────────────────────────────────────┐
│ Bucket: "my-app-images" │
│ │
│ ┌──────────────────────────────────┐│
│ │ Object: "users/123/avatar.jpg" ││
│ │ Data: [binary image data] ││
│ │ Metadata: { ││
│ │ content-type: "image/jpeg", ││
│ │ size: 245760, ││
│ │ created: "2025-01-15", ││
│ │ custom-tags: {"user": "123"} ││
│ │ } ││
│ └──────────────────────────────────┘│
│ │
│ ┌──────────────────────────────────┐│
│ │ Object: "logs/2025/01/app.log" ││
│ │ Data: [log text data] ││
│ │ Metadata: { ... } ││
│ └──────────────────────────────────┘│
└──────────────────────────────────────┘
Characteristics:
- Flat namespace (no directories — "/" in key name is just convention).
- Accessed via HTTP/REST API (PUT, GET, DELETE).
- Rich metadata per object.
- Virtually unlimited storage.
- Eventually consistent (some services offer strong consistency now).
- Not suitable for frequent small modifications (objects are immutable; must replace whole object).
Cloud examples:
| Service | Provider |
|---------|----------|
| S3 (Simple Storage Service) | AWS |
| Cloud Storage | Google Cloud |
| Azure Blob Storage | Azure |
S3 Storage Classes:
| Class | Availability | Use Case | Cost |
|-------|-------------|----------|------|
| Standard | 99.99% | Frequently accessed |
| Standard-IA | 99.9% | Infrequent access | $$ |
| Glacier Instant | 99.9% | Archive with millisecond retrieval | $ |
| Glacier Flexible | 99.99% | Archive (minutes to hours retrieval) | ¢ |
| Glacier Deep Archive | 99.99% | Long-term archive (12-48 hour retrieval) | fraction of ¢ |
Use cases:
- Static assets (images, videos, CSS, JS).
- Backups and archives.
- Data lake storage for analytics.
- Log storage.
- Content distribution (origin for CDN).
Storage Comparison
| Aspect | Block | File | Object |
|---|---|---|---|
| Access | Block address | File path | HTTP API (key) |
| Structure | Flat blocks | Hierarchical | Flat namespace |
| Metadata | None | Basic (permissions, dates) | Rich custom metadata |
| Performance | Highest | Moderate | Lower (HTTP overhead) |
| Scalability | Limited (single volume) | Moderate | Virtually unlimited |
| Modification | In-place | In-place | Replace entire object |
| Sharing | Single attachment | Multi-server (NFS) | Universal (HTTP) |
| Cost | $$$ | $$ | $ |
| Best for | Databases, OS | Shared files, CMS | Static assets, backups |
Distributed File Systems
HDFS (Hadoop Distributed File System)
Designed for storing very large files across clusters of commodity hardware.
Client → NameNode (metadata: which blocks, which DataNodes)
↓
[DataNode 1] [DataNode 2] [DataNode 3]
Block A1 Block A1 Block A2
Block B1 Block A2 Block B1
Block B2 Block B2
Features:
- Files split into large blocks (128 MB default).
- Each block replicated (default 3x).
- NameNode manages metadata; DataNodes store blocks.
- Optimized for sequential reads of large files.
- Write-once, read-many pattern.
Google File System (GFS) / Colossus
Google's distributed file system that inspired HDFS:
- Single master (metadata) + chunk servers.
- 64 MB chunks with 3x replication.
- Designed for large sequential reads/writes.
- Colossus is the successor to GFS.
Database Storage Engines
B-Tree (Traditional)
Used by most SQL databases (PostgreSQL, MySQL InnoDB).
[10 | 20 | 30]
/ | | \
[1-9] [11-19] [21-29] [31-40]
- Balanced tree structure.
read and write.- Good for read-heavy workloads.
- In-place updates (modify data on disk).
LSM Tree (Log-Structured Merge Tree)
Used by many NoSQL databases (Cassandra, RocksDB, LevelDB, HBase).
Write → MemTable (in-memory sorted buffer)
↓ (when full)
SSTable L0 (on disk, immutable)
↓ (compaction)
SSTable L1
↓ (compaction)
SSTable L2
- Writes go to in-memory buffer first → much faster writes.
- Immutable sorted files on disk (SSTables).
- Background compaction merges and sorts files.
writes (amortized), reads may need to check multiple levels.- Good for write-heavy workloads.
B-Tree vs LSM Tree
| Aspect | B-Tree | LSM Tree |
|---|---|---|
| Read performance | Excellent | Good (may check multiple levels) |
| Write performance | Good | Excellent |
| Space amplification | Low | Higher (multiple copies during compaction) |
| Write amplification | Higher (in-place updates) | Lower (sequential writes) |
| Use case | Read-heavy OLTP | Write-heavy workloads |
| Examples | PostgreSQL, MySQL | Cassandra, RocksDB, LevelDB |
Write-Ahead Log (WAL)
A durability mechanism: every change is first written to an append-only log before being applied to the actual data.
1. Client sends WRITE(x=5)
2. Server writes to WAL: "SET x=5" → fsync to disk
3. Server ACKs client
4. Server applies change to data structures (eventually)
On crash:
- Replay WAL entries to recover uncommitted changes
Used by virtually all databases: PostgreSQL (WAL), MySQL (redo log), Cassandra (commit log).
Data Serialization Formats
| Format | Type | Human Readable | Size | Speed | Schema |
|---|---|---|---|---|---|
| JSON | Text | Yes | Large | Slow | No |
| XML | Text | Yes | Largest | Slowest | Optional (XSD) |
| Protocol Buffers | Binary | No | Small | Very fast | Required (.proto) |
| Avro | Binary | No | Small | Fast | Required (JSON schema) |
| Thrift | Binary | No | Small | Fast | Required (.thrift) |
| MessagePack | Binary | No | Small | Fast | No |
| Parquet | Binary (columnar) | No | Smallest | Fast reads | Embedded |
Summary
| Concept | Key Point |
|---|---|
| Block storage | Raw blocks, highest performance — for databases and VMs |
| File storage | Hierarchical files, shareable — for shared content |
| Object storage | HTTP-accessible, unlimited scale — for static assets and backups |
| B-Tree | Balanced reads/writes — traditional SQL databases |
| LSM Tree | Optimized for writes — NoSQL and write-heavy workloads |
| WAL | Durability guarantee — log before apply |
Rule of thumb: Use block storage for databases. Use object storage for static assets and backups. Use file storage when applications need a shared file system.