🎉 All 30 days are live — the full DSA-30 course, from Big-O to System Design. See the roadmap →

Design a Chat App (WhatsApp / Messenger) medium

The prompt

Two users exchange messages in real time: 1:1 chat, delivery within a second, message history preserved, and the sender sees sent/delivered/read receipts. The defining challenge is server-initiated, real-time delivery — the server must push to the recipient, not wait to be polled.

Requirements

  • Functional: send/receive 1:1 messages in real time; persist history; show online presence and delivery/read receipts.
  • Non-functional: low latency (sub-second delivery), reliable (no lost messages, even if the recipient is offline), ordered (messages appear in send order within a conversation).

Estimation

500 M DAU, ~40 messages each/day → 20 B messages/day ≈ 230k/s (peak higher). Each message is small (~100s of bytes). The harder number is concurrent connections: hundreds of millions of users hold a persistent connection — that’s the real scaling pressure, not raw QPS.

The core decision: how does the server push to the client?

ApproachHowVerdict
Short pollingclient asks “anything new?” every few secondswasteful, laggy — most polls return nothing
Long pollingrequest hangs open until there’s data or timeoutbetter, but reconnect churn
WebSocketone persistent, bidirectional TCP connectionthe answer — server pushes instantly over an open socket

WebSockets are the heart of any chat design. Unlike HTTP request/response, a WebSocket is a long-lived, two-way pipe: once open, the server can push a message to the client the instant it arrives, with no polling. The cost is stateful connections — each gateway server holds many open sockets and must remember which user is connected to which server, which is the central scaling problem below.

High-level design

Persistent WebSocket connections, a session registry, and durable storage
WebSocketWebSocketsendwhere is B?persistpush to Bif offlineUser AUser BWS Gateway 1holds socketsWS Gateway 2holds socketsSession Registryuser→gatewayMessage SvcMessages DBhistoryOffline Queue
Each user holds a WebSocket to some gateway. A session registry (Redis) maps user → gateway so the message service knows where to route. Every message is persisted to the DB for history; if the recipient is offline, it waits in a queue and a push notification is sent.

Sending a message:

  1. User A sends over their WebSocket to Gateway 1 → Message Service.
  2. Message Service persists it (history + the “delivered later” guarantee).
  3. It looks up B in the session registry: which gateway holds B’s socket?
  4. If B is online → route to B’s gateway → push over B’s socket. If B is offline → store as undelivered and fire a push notification; deliver when B reconnects.

Deep dives

  • The session registry is the key scaling piece: with sockets spread across thousands of gateways, you need a fast user_id → gateway map (Redis) so a message can find its recipient’s connection. This decouples “who’s connected where” from the message logic.
  • Ordering: attach a per-conversation sequence number or timestamp so the client can order/dedupe. Within one conversation, route through a consistent partition to preserve order.
  • Delivery guarantees & receipts: the persist-then-deliver flow gives at-least-once delivery; the client ACKs receipt (→ “delivered”), and a read event flows back as another small message (→ “read”). Messages must be idempotent (dedupe by message ID) since a client may retry.
  • Group chat is the natural extension: fan the message out to every member’s gateway (small groups → push to each; this is the news-feed fan-out pattern again).
⚠️

Persist before you deliver. If you push to the recipient but crash before saving, an offline or reconnecting user loses the message forever. Writing to durable storage first (then delivering) is what lets you guarantee “no message is ever lost,” which is non-negotiable for a messaging product.

Analysis

  • Delivery latency: sub-second when both online (one persist + one socket push).
  • Connection load: the dominant cost — millions of concurrent WebSockets across a gateway fleet, tracked in the session registry.

Same skin

  • Live notifications, multiplayer game state, collaborative editing, stock tickers — all are “server pushes to many connected clients in real time” → WebSockets + a session registry.
  • The offline-queue + push-notification path is exactly the notification service.
  • Group fan-out reuses the news feed push model.