Scaling & Trade-offs
A design that works for 1,000 users and a design that works for 100 million are different machines. This page is about the moves you make to cross that gap — and, more importantly, the trade-offs each move forces on you. Every scaling technique buys you something and charges you something. Naming the charge is what scores points.
Vertical vs horizontal scaling
There are exactly two ways to handle more load:
| Vertical (scale up) | Horizontal (scale out) | |
|---|---|---|
| Move | bigger CPU / more RAM on one machine | more machines behind a load balancer |
| Pros | dead simple, no code changes, no distributed-systems problems | (almost) unlimited ceiling, fault-tolerant (one dies, others serve) |
| Cons | hard ceiling (biggest box money can buy), single point of failure, cost grows super-linearly | requires statelessness, a load balancer, and you inherit distributed-systems pain |
| Verdict | the right first move | the only path past one machine’s limits |
The golden rule that makes horizontal scaling possible: keep app servers stateless. If a server stores session data in its own memory, requests must return to that exact server (sticky sessions) — and when it dies, that state is gone. Push all state out of the app server into a shared store (Redis, the database). Then every server is interchangeable, the load balancer can route anywhere, and adding capacity is just adding boxes.
Replication — copies for safety and read scale
Replication keeps multiple copies of your data on different machines. It buys two things: durability (one disk dies, the data survives) and read throughput (reads can hit any copy).
The common pattern is primary-replica (a.k.a. leader-follower): all writes go to the primary, which streams its changes to read-only replicas. Reads fan out across the replicas.
The trade-off: replication lag. Copying the write to replicas isn’t instant, so a read right after a write might land on a not-yet-updated replica and return stale data. Fixes include reading-your-own-writes from the primary, or accepting the staleness where it’s harmless (a like count can lag; your bank balance can’t).
Sharding — splitting data when it won’t fit
Replication copies the whole dataset to each machine — great until the dataset is bigger than one machine can hold (our URL shortener’s 90 TB). Sharding (partitioning) splits the data into pieces and puts each piece on a different machine. Each shard holds a subset of the data.
The whole game is the shard key:
- Hash-based —
shard = hash(key) % N. Spreads load evenly, but range queries become impossible (adjacent keys scatter across shards), and changingNreshuffles almost everything (see consistent hashing below). - Range-based —
shard = which range key falls in(a–h, i–p, …). Range queries stay efficient, but you risk hot shards if data isn’t uniform (everyone named with ‘S’, or all of today’s timestamps landing on one shard).
Sharding’s hidden costs. Once data is split: (1) cross-shard queries and joins become slow or impossible — a query touching many shards must scatter-gather and merge. (2) Transactions across shards are hard (no single DB to give you ACID). (3) Rebalancing when you add a shard is painful. (4) A bad shard key creates a hot shard that bottlenecks the whole system. Shard only when you must — it’s a one-way door that complicates everything downstream.
Consistent hashing — sharding that survives change
Naive hash(key) % N has a fatal flaw: change N (add or remove a server) and almost every key remaps to a different server — catastrophic for a cache (every entry suddenly misses) or a sharded store (you move nearly all the data).
Consistent hashing fixes this. Map both servers and keys onto a circular ring (0 to 2³²). A key belongs to the first server clockwise from it. Now when a server joins or leaves, only the keys between it and its neighbor move — about 1/N of the keys, not all of them.
This is the answer when an interviewer asks “how do you add a cache node without invalidating the whole cache?” or “how does your sharded store rebalance?” Consistent hashing is the backbone of Cassandra, DynamoDB, and most distributed caches.
The CAP theorem
The most famous trade-off in distributed systems, and the one interviewers probe to see if you understand it or just memorized the letters.
The claim: when a network partition (P) happens — some nodes can’t talk to others — a distributed system must choose between consistency (C: every read sees the latest write) and availability (A: every request gets a non-error response). You cannot have both during the partition.
The crucial nuance most candidates miss: partitions are not optional. In any real distributed system the network will drop messages, so P is a given. That means the real choice is never “pick 2 of 3” — it’s the binary CP vs AP:
- CP — on a partition, refuse to answer rather than risk stale/conflicting data. Choose when correctness beats uptime: payments, inventory, locks, leader election.
- AP — on a partition, keep answering from local state and reconcile later. Choose when uptime beats perfect freshness: feeds, shopping carts, DNS, metrics, likes.
CAP is about behavior during a partition, not all the time. When the network is healthy, a well-built system delivers both consistency and availability. CAP only forces the choice in the (rare but inevitable) moment nodes can’t communicate. The follow-up framework, PACELC, adds: else (E), when there’s no partition, you still trade Latency vs Consistency. Strong consistency costs round trips; that’s why even healthy systems often relax it for speed.
Consistency models — a spectrum, not a switch
“Consistency” isn’t binary. It’s a dial from expensive-and-correct to cheap-and-fast:
| Model | Guarantee | Cost | Used for |
|---|---|---|---|
| Strong | every read sees the latest write, always | slow — needs coordination/round trips | bank balances, locks |
| Read-your-writes | you see your own writes immediately (others may lag) | moderate | ”I posted it, I should see it” |
| Eventual | replicas converge eventually if writes stop | cheap, fast, highly available | likes, view counts, DNS, feeds |
The art is matching the model to the data. A like count can be eventually consistent — nobody’s hurt if it reads 41 instead of 42 for a second. A bank balance cannot. Picking the weakest model that’s still correct for the use case is a senior move, because stronger consistency always costs latency and availability.
The scaling decision tree
Start with one box (vertical)
Simplest possible thing. Scale up until it’s not enough or too expensive.
Make app servers stateless, add a load balancer (horizontal)
Push state to a shared store. Now add app servers freely.
Add a cache + read replicas
Reads almost always dominate. A cache and read replicas absorb the read load and protect the database.
Shard the database (only when forced)
When the data or write volume exceeds one machine, split it. Choose a shard key carefully; use consistent hashing to rebalance gracefully.
Go async with queues; go global with CDNs/multi-region
Offload slow work to queues. Put data near users with CDNs and regional replicas, accepting the eventual-consistency that distance forces.
Quick check
Next: Case Study — URL Shortener — the full framework applied end-to-end, where every concept from these pages shows up in one design.