🎉 All 30 days are live — the full DSA-30 course, from Big-O to System Design. See the roadmap →

Design a Distributed Key-Value Store hard

The prompt

Build a get(key) / put(key, value) store that scales to far more data than one machine holds, stays available when machines fail, and survives the network splitting. This is the “design Dynamo/Cassandra” question — it forces every distributed-systems concept from the scaling page into one design.

Requirements

  • Functional: get(key), put(key, value). That’s the whole API — simplicity is the point.
  • Non-functional: horizontally scalable (add nodes for more capacity), highly available (a node failure isn’t an outage), fault-tolerant (survives partitions), tunable consistency (let the caller choose strong-ish vs fast).

Estimation

Designed for the “doesn’t fit / outgrows one box” regime: terabytes-to-petabytes of data, hundreds of thousands of ops/sec, spread across a large cluster. The whole design assumes no single machine is special — there’s no primary.

Core idea 1 — partition with consistent hashing

To spread keys across nodes and rebalance gracefully when nodes join/leave, place nodes and keys on a hash ring. A key is owned by the first node clockwise. Adding or removing a node only moves the keys in that one arc (~1/N), not the whole dataset.

Consistent hashing ring — keys owned by the next node clockwise, replicated to the following N-1
ownerreplicaNode 1key 'x'Node 2Node 3Node 4
key 'x' hashes to a ring position; its owner is the next node clockwise (Node 2), and copies live on the following N-1 nodes (Node 3, …) for replication. Add a node and only the keys in its arc move — the rest of the cluster is undisturbed.

Core idea 2 — replication

Each key is stored on N nodes (e.g. N=3): its owner plus the next N-1 nodes clockwise. This gives durability (lose a node, the data lives on two others) and availability (reads/writes can go to any replica).

Core idea 3 — quorum for tunable consistency

With N replicas, let the caller tune consistency via two knobs:

  • W = how many replicas must acknowledge a write before it’s “successful.”
  • R = how many replicas must respond to a read.

The quorum rule: if W + R > N, reads and writes overlap on at least one node, so a read is guaranteed to see the latest write — strong consistency. Tune the knobs to the workload:

  • W=N, R=1 → fast reads, slow/strict writes (read-heavy, must-be-fresh reads).
  • W=1, R=N → fast writes, slow reads.
  • W + R ≤ N → no overlap guarantee → faster, but eventually consistent (you might read stale data).

This single inequality is the thing to know for this question — it turns “consistency” from a yes/no into a dial the caller controls per operation.

High-level design

Coordinator routes to N replicas on the ring; quorum decides success
put(k,v)ClientCoordinatorany nodeReplica 1Replica 2Replica 3
Any node can act as coordinator: it hashes the key, finds the N replicas on the ring, and writes to all of them — returning success once W acknowledge. Reads gather from R replicas and return the newest version.

Deep dives

  • CAP choice: a Dynamo-style store is typically AP — it stays available during a partition and resolves conflicts later (eventual consistency). With high W+R you can lean toward CP. This is the CAP theorem made concrete: the W/R knobs are the C-vs-A dial.
  • Conflict resolution: when replicas disagree (concurrent writes during a partition), use vector clocks to detect causality, and resolve by “last-write-wins” or by handing both versions to the application to merge.
  • Failure detection & healing: nodes gossip health to each other; a temporarily-down replica’s writes are held by a healthy node (hinted handoff) and replayed when it returns; background anti-entropy (Merkle-tree comparison) repairs drifted replicas.
  • Virtual nodes: give each physical node many ring positions so load and the rebalancing burden spread evenly when nodes change.

Analysis

  • Scalability: linear — add nodes, the ring absorbs them, capacity and throughput grow.
  • Availability: survives node and (with AP tuning) partition failures.
  • Consistency: tunable per-operation via W and R.

Same skin

  • Cassandra, DynamoDB, Riak, Voldemort — all real instances of this exact design.
  • Distributed caches (a cache is a KV store with eviction) use the same consistent-hashing ring.
  • Consistent hashing and the CAP theorem from the scaling page are the load-bearing ideas here — this problem is where they come together.