1. Scalability

Scalability is the ability of a system to handle a growing amount of work by adding resources. A scalable system can maintain or improve its level of performance as demand increases.


Why Scalability Matters

  • User growth: Applications must handle increasing user traffic without degradation.
  • Data growth: As data accumulates, read/write performance must remain acceptable.
  • Business continuity: Revenue-generating systems cannot afford downtime under load.
  • Cost efficiency: Resources should scale proportionally — not wastefully — with demand.

Types of Scalability

1. Vertical Scaling (Scale Up)

Adding more power (CPU, RAM, disk, network) to an existing machine.

Before:  [Server: 4 CPU, 16 GB RAM]
After:   [Server: 32 CPU, 256 GB RAM]

Advantages:

  • Simple to implement — no code changes required.
  • No distributed systems complexity.
  • Easier data consistency (single machine).

Disadvantages:

  • Hardware ceiling: There is a physical limit to how much you can scale a single machine.
  • Single point of failure: If that machine goes down, everything goes down.
  • Expensive: High-end hardware costs grow non-linearly.
  • Downtime: Often requires stopping the machine to upgrade.

When to use:

  • Early-stage applications with predictable, moderate load.
  • Databases that are hard to distribute (e.g., traditional RDBMS).
  • Quick fixes when time is limited.

2. Horizontal Scaling (Scale Out)

Adding more machines to the pool of resources.

Before:  [Server A]
After:   [Server A] [Server B] [Server C] [Server D]
              \          |          |          /
               \         |          |         /
                 [    Load Balancer         ]

Advantages:

  • Near-infinite scalability: Add more commodity machines.
  • Fault tolerance: If one server fails, others take over.
  • Cost effective: Use cheaper commodity hardware.
  • No downtime: Add or remove nodes without stopping the system.

Disadvantages:

  • Increased system complexity (load balancing, data partitioning, etc.).
  • Data consistency challenges in distributed environments.
  • Network latency between nodes.
  • More complex deployment and monitoring.

When to use:

  • Large-scale web applications (Google, Netflix, Amazon).
  • Stateless services (e.g., REST APIs).
  • Systems with unpredictable or bursty load patterns.

Scalability Dimensions

Dimension Description Example
Load Scalability Handle increasing request volume 1K → 1M requests/sec
Geographic Scalability Serve users across regions US, EU, Asia data centers
Administrative Scalability Multiple organizations can manage the system Multi-tenant SaaS
Functional Scalability Easily add new features without impacting existing ones Microservices architecture

Key Metrics for Scalability

Throughput

The number of requests a system can handle per unit time.

  • Measured in requests per second (RPS) or transactions per second (TPS).

Latency

The time it takes for a single request to be processed.

  • p50 (median): 50% of requests are faster than this.
  • p95: 95% of requests are faster (tail latency).
  • p99: 99% of requests are faster.

Availability

The percentage of time the system is operational.

  • 99.9% (three 9s) = ~8.76 hours downtime/year.
  • 99.99% (four 9s) = ~52.6 minutes downtime/year.
  • 99.999% (five 9s) = ~5.26 minutes downtime/year.

Capacity

Maximum load a system can handle before performance degrades.


Scaling Strategies

Stateless Architecture

  • Store no session data on application servers.
  • Use external stores (Redis, Memcached, database) for state.
  • Any server can handle any request → easy horizontal scaling.
Client → Load Balancer → Any App Server → Shared State Store (Redis)

Database Scaling

Strategy Description
Read Replicas Multiple copies of the database handle read queries
Sharding Split data across multiple databases by a key
Denormalization Reduce joins by storing redundant data
Connection Pooling Reuse DB connections to reduce overhead

Asynchronous Processing

  • Use message queues (Kafka, RabbitMQ, SQS) to decouple work.
  • Long-running tasks are processed in the background.
  • The user gets an immediate response; work completes later.
User Request → API Server → Message Queue → Worker → Database
                   ↓
             Immediate ACK

Caching

  • Store frequently accessed data in memory (Redis, Memcached).
  • Dramatically reduces database load and latency.
  • Can be applied at multiple layers: application, database, CDN.

Scaling Patterns

1. Load Balancer Pattern

Distribute requests across multiple servers.

2. Database Replication Pattern

Primary handles writes; replicas handle reads.

3. Sharding Pattern

Partition data across multiple database nodes.

4. CQRS (Command Query Responsibility Segregation)

Separate read and write models for independent scaling.

5. Event-Driven Architecture

Components communicate through events, enabling loose coupling and independent scaling.


Amdahl's Law

Defines the theoretical speedup when scaling a system:

Where:

  • = speedup with processors
  • = fraction of the program that is parallelizable
  • = number of processors

Key insight: If only 50% of the work is parallelizable, adding infinite processors only gives a 2x speedup. The serial portion becomes the bottleneck.


Universal Scalability Law (USL)

Extends Amdahl's Law by adding coherence overhead — the cost of keeping distributed nodes consistent:

Where:

  • = contention (serialization)
  • = coherence (crosstalk between nodes)

Key insight: Beyond a certain point, adding more nodes decreases throughput due to coordination overhead.


Real-World Scaling Examples

Company Strategy
Netflix Microservices + auto-scaling on AWS
Twitter Fan-out on write for timelines; Redis for caching
Uber Geospatial sharding; separate services per domain
Google Custom infrastructure (Bigtable, Spanner, MapReduce)
Instagram Cassandra for feeds; PostgreSQL for relational data

Common Bottlenecks

  1. Single database: All reads/writes go to one DB.
  2. CPU-bound services: Heavy computation on a single node.
  3. Memory limits: Data exceeds available RAM.
  4. Disk I/O: Slow storage access under heavy write load.
  5. Network bandwidth: Saturated network links between services.
  6. Lock contention: Threads waiting on shared resources.
  7. DNS resolution: Single DNS endpoint with no failover.

Summary

Concept Vertical Scaling Horizontal Scaling
Approach Bigger machine More machines
Complexity Low High
Cost curve Exponential Linear
Fault tolerance Low High
Ceiling Hardware limits Practically unlimited
Downtime risk High Low

Rule of thumb: Start with vertical scaling for simplicity. Move to horizontal scaling when you approach the limits of a single machine or need fault tolerance.