🎉 All 30 days are live — the full DSA-30 course, from Big-O to System Design. See the roadmap →
Day 28 - System Design 101Scaling & Trade-offs

Scaling & Trade-offs

A design that works for 1,000 users and a design that works for 100 million are different machines. This page is about the moves you make to cross that gap — and, more importantly, the trade-offs each move forces on you. Every scaling technique buys you something and charges you something. Naming the charge is what scores points.

Vertical vs horizontal scaling

There are exactly two ways to handle more load:

Scale up (a bigger box) vs scale out (more boxes)
buy biggerServer4 CPU / 16 GBSame Server64 CPU / 512 GBLoad BalancerServerServerServer
Left: vertical scaling — replace the box with a beefier one. Right: horizontal scaling — add identical boxes behind a load balancer. Vertical is simpler but hits a hard ceiling; horizontal is the only path to true scale.
Vertical (scale up)Horizontal (scale out)
Movebigger CPU / more RAM on one machinemore machines behind a load balancer
Prosdead simple, no code changes, no distributed-systems problems(almost) unlimited ceiling, fault-tolerant (one dies, others serve)
Conshard ceiling (biggest box money can buy), single point of failure, cost grows super-linearlyrequires statelessness, a load balancer, and you inherit distributed-systems pain
Verdictthe right first movethe only path past one machine’s limits

The golden rule that makes horizontal scaling possible: keep app servers stateless. If a server stores session data in its own memory, requests must return to that exact server (sticky sessions) — and when it dies, that state is gone. Push all state out of the app server into a shared store (Redis, the database). Then every server is interchangeable, the load balancer can route anywhere, and adding capacity is just adding boxes.

Replication — copies for safety and read scale

Replication keeps multiple copies of your data on different machines. It buys two things: durability (one disk dies, the data survives) and read throughput (reads can hit any copy).

The common pattern is primary-replica (a.k.a. leader-follower): all writes go to the primary, which streams its changes to read-only replicas. Reads fan out across the replicas.

Primary-replica: writes to one, reads from many
writesreadsreplicateApp ServersPrimary DBall writesReplicareadsReplicareadsReplicareads
Perfect for read-heavy workloads (the common case). The catch: replication takes time, so a read hitting a replica a moment after a write may see stale data — that's replication lag, a form of eventual consistency.

The trade-off: replication lag. Copying the write to replicas isn’t instant, so a read right after a write might land on a not-yet-updated replica and return stale data. Fixes include reading-your-own-writes from the primary, or accepting the staleness where it’s harmless (a like count can lag; your bank balance can’t).

Sharding — splitting data when it won’t fit

Replication copies the whole dataset to each machine — great until the dataset is bigger than one machine can hold (our URL shortener’s 90 TB). Sharding (partitioning) splits the data into pieces and puts each piece on a different machine. Each shard holds a subset of the data.

Sharding by key range — each shard owns a slice of the keyspace
Router / Apphash(key) → shardShard Ausers a–hShard Busers i–pShard Cusers q–z
The shard key decides which machine a row lives on. Pick it well and load spreads evenly; pick it badly and one shard gets all the traffic (a 'hot shard').

The whole game is the shard key:

  • Hash-basedshard = hash(key) % N. Spreads load evenly, but range queries become impossible (adjacent keys scatter across shards), and changing N reshuffles almost everything (see consistent hashing below).
  • Range-basedshard = which range key falls in (a–h, i–p, …). Range queries stay efficient, but you risk hot shards if data isn’t uniform (everyone named with ‘S’, or all of today’s timestamps landing on one shard).
⚠️

Sharding’s hidden costs. Once data is split: (1) cross-shard queries and joins become slow or impossible — a query touching many shards must scatter-gather and merge. (2) Transactions across shards are hard (no single DB to give you ACID). (3) Rebalancing when you add a shard is painful. (4) A bad shard key creates a hot shard that bottlenecks the whole system. Shard only when you must — it’s a one-way door that complicates everything downstream.

Consistent hashing — sharding that survives change

Naive hash(key) % N has a fatal flaw: change N (add or remove a server) and almost every key remaps to a different server — catastrophic for a cache (every entry suddenly misses) or a sharded store (you move nearly all the data).

Consistent hashing fixes this. Map both servers and keys onto a circular ring (0 to 2³²). A key belongs to the first server clockwise from it. Now when a server joins or leaves, only the keys between it and its neighbor move — about 1/N of the keys, not all of them.

The hash ring — a key belongs to the next server clockwise
→ A→ B→ BServer Akey1Server Bkey2Server Ckey3
Each key hashes to a point on the ring and is owned by the next server clockwise. Remove Server B and only key1 + key2 move (to the next server clockwise) — every other key stays put. Real systems add 'virtual nodes' (many ring points per server) for even distribution.

This is the answer when an interviewer asks “how do you add a cache node without invalidating the whole cache?” or “how does your sharded store rebalance?” Consistent hashing is the backbone of Cassandra, DynamoDB, and most distributed caches.

The CAP theorem

The most famous trade-off in distributed systems, and the one interviewers probe to see if you understand it or just memorized the letters.

The claim: when a network partition (P) happens — some nodes can’t talk to others — a distributed system must choose between consistency (C: every read sees the latest write) and availability (A: every request gets a non-error response). You cannot have both during the partition.

CAP theorem — under a network partition you may keep only two
Tap a pair. Since real distributed systems must tolerate partitions (P), the real choice is CP vs AP.
CConsistencyAAvailabilityPPartition tolerance
CP — Consistent + Partition-tolerant
When the network splits, the system refuses to answer rather than return stale or conflicting data. You sacrifice availability: some requests error or block until the partition heals. Pick this when correctness beats uptime — money, inventory, locks.
Examples: HBase, MongoDB (default), ZooKeeper, etcd, Spanner.

The crucial nuance most candidates miss: partitions are not optional. In any real distributed system the network will drop messages, so P is a given. That means the real choice is never “pick 2 of 3” — it’s the binary CP vs AP:

  • CP — on a partition, refuse to answer rather than risk stale/conflicting data. Choose when correctness beats uptime: payments, inventory, locks, leader election.
  • AP — on a partition, keep answering from local state and reconcile later. Choose when uptime beats perfect freshness: feeds, shopping carts, DNS, metrics, likes.

CAP is about behavior during a partition, not all the time. When the network is healthy, a well-built system delivers both consistency and availability. CAP only forces the choice in the (rare but inevitable) moment nodes can’t communicate. The follow-up framework, PACELC, adds: else (E), when there’s no partition, you still trade Latency vs Consistency. Strong consistency costs round trips; that’s why even healthy systems often relax it for speed.

Consistency models — a spectrum, not a switch

“Consistency” isn’t binary. It’s a dial from expensive-and-correct to cheap-and-fast:

ModelGuaranteeCostUsed for
Strongevery read sees the latest write, alwaysslow — needs coordination/round tripsbank balances, locks
Read-your-writesyou see your own writes immediately (others may lag)moderate”I posted it, I should see it”
Eventualreplicas converge eventually if writes stopcheap, fast, highly availablelikes, view counts, DNS, feeds

The art is matching the model to the data. A like count can be eventually consistent — nobody’s hurt if it reads 41 instead of 42 for a second. A bank balance cannot. Picking the weakest model that’s still correct for the use case is a senior move, because stronger consistency always costs latency and availability.

The scaling decision tree

Start with one box (vertical)

Simplest possible thing. Scale up until it’s not enough or too expensive.

Make app servers stateless, add a load balancer (horizontal)

Push state to a shared store. Now add app servers freely.

Add a cache + read replicas

Reads almost always dominate. A cache and read replicas absorb the read load and protect the database.

Shard the database (only when forced)

When the data or write volume exceeds one machine, split it. Choose a shard key carefully; use consistent hashing to rebalance gracefully.

Go async with queues; go global with CDNs/multi-region

Offload slow work to queues. Put data near users with CDNs and regional replicas, accepting the eventual-consistency that distance forces.

Quick check

You're sharding a cache with hash(key) % N. You add one server, going from 4 to 5. What happens, and what's the fix?
A distributed shopping-cart service hits a network partition. The team decides the cart should still accept 'add to cart' even if replicas can't sync, reconciling later. On the CAP spectrum, what did they choose and why is it reasonable?
Why is keeping app servers STATELESS the precondition for horizontal scaling?

Next: Case Study — URL Shortener — the full framework applied end-to-end, where every concept from these pages shows up in one design.