🎉 All 30 days are live — the full DSA-30 course, from Big-O to System Design. See the roadmap →

Design a Notification Service hard

The prompt

A central service other systems call to notify users across multiple channels — mobile push, SMS, email, in-app — reliably and at scale. “Your order shipped,” “new login detected,” “someone liked your post.” The challenge isn’t any single send; it’s doing millions of them reliably through flaky third-party providers.

Requirements

  • Functional: accept a “notify user U with content C via channels X” request; deliver across push (APNs/FCM), SMS, email; respect user preferences and opt-outs.
  • Non-functional: reliable (don’t drop notifications), scalable (millions/day, bursty), no duplicates (don’t send the same alert twice), decoupled (a slow email provider shouldn’t block push).

Estimation

10 M notifications/day → ~115/s average but very bursty (a marketing blast or an incident can spike to tens of thousands/s). Burstiness + unreliable external providers → this is a textbook queue + worker design.

The core pattern: queue-per-channel with workers

Decouple the request from the delivery. The API validates and enqueues instantly; per-channel worker pools drain the queues and call the external providers at their own pace. A spike just makes the queues longer — nothing falls over.

One entry point, fan-out to per-channel queues, workers call providers
notify(U, C)Caller ServicesNotification APIvalidate + prefsPush QueueSMS QueueEmail QueuePush WorkerSMS WorkerEmail WorkerAPNs / Twilio / SES
The API checks user preferences, then fans the request out to a queue per channel. Each channel has its own worker pool and scales independently — a backed-up email provider never slows push delivery. Workers handle the messy provider integrations.

The flow:

  1. A service calls POST /notify with user, content, and channels.
  2. The API validates, checks user preferences/opt-outs, looks up device tokens / phone / email, and enqueues a job per channel.
  3. Per-channel workers dequeue and call the external provider (APNs, FCM, Twilio, SES), handling provider-specific quirks.

Deep dives

  • Reliability & retries: providers fail transiently. Workers retry with exponential backoff; after N failures, route to a dead-letter queue for inspection rather than silently dropping. The queue’s persistence guarantees a notification isn’t lost if a worker crashes mid-send.
  • Idempotency (no duplicates): queues give at-least-once delivery, so a job can be processed twice. Attach a dedupe key (notification ID) and track sent IDs (in Redis) so a retry of an already-sent notification is a no-op. This is the most important correctness detail — nobody wants the same SMS twice.
  • Rate limiting: both to respect provider quotas (Twilio caps your send rate) and to avoid spamming a user — cap notifications per user per time window. Reuse the rate limiter.
  • Preferences & templating: a preference service (opt-outs, quiet hours, per-channel settings) gates sends; a template service renders content per channel/locale.
  • Priority: a 2FA code must beat a marketing blast — separate high/low priority queues so transactional notifications jump the line.
⚠️

At-least-once + idempotency is the combo to say out loud. You want at-least-once delivery (retries) so nothing is lost — but that means duplicates are possible, so consumers must dedupe by an idempotency key. “I’d make the workers idempotent with a dedupe key so retries are safe” is the single sentence that signals you’ve built a real queue-based system, not just drawn one.

Analysis

  • Throughput: scales per channel by adding workers; queues absorb bursts.
  • Reliability: persistent queues + retries + dead-letter queue → no silent drops.
  • Correctness: idempotency keys prevent duplicate sends under at-least-once delivery.

Same skin

  • The offline-message path of the chat app is a notification service in miniature.
  • Email/SMS marketing platforms, alerting/on-call systems (PagerDuty), order-status pipelines — same queue + worker + retry + idempotency skeleton.
  • Message queues and the fan-out idea from the news feed are the load-bearing patterns.