Design a notification system

A notification system delivers timely messages to users across channels—push, email, SMS, and in-app—when something happens in your product: order shipped, password reset, friend mentioned you, or marketing campaign. Unlike a simple “send email” script, production systems must handle millions of recipients, respect user preferences, survive provider outages, and avoid duplicate or lost messages when everything retries at once.

This guide walks the full interview arc—requirements, capacity, architecture, storage, APIs, channel adapters—and dedicates sections to failure points and failure modes, the same depth as the URL shortener and payment system guides on this site.

Design prompt

Design a notification platform that accepts events from product services and delivers messages to users on their preferred channels.

Support high throughput, retries, scheduling, and at-least-once ingestion without duplicate user-visible spam.

What you should be able to do after reading:

Separate event ingestion, fan-out, channel routing, and delivery workers.
Size queue throughput for marketing bursts vs steady transactional traffic.
Model templates, user preferences, device tokens, and delivery status.
Explain push (FCM/APNs), email, and SMS provider constraints.
Map failure points and failure modes with idempotency and DLQ recovery.

1. Requirements gathering

1.1 Functional requirements

Send notification — product teams publish an event (e.g. order.shipped) with payload and target user(s).
Multi-channel delivery — push, email, SMS, in-app inbox (and optionally webhook to merchant).
User preferences — per-channel opt-in/out, quiet hours, locale, category toggles (marketing vs transactional).
Templates — parameterized subject/body per channel; i18n variants.
Scheduling — send now or at a future time (reminders, digest at 9am).
Priority — transactional (OTP, security) bypasses marketing caps; rate limits differ.
Delivery tracking — queued, sent, delivered, failed, opened (where provider supports).
Bulk / broadcast (optional) — send to segment or all users (marketing blast).

Usually out of scope unless asked: building your own SMTP server, in-app chat, full campaign A/B analytics platform, WhatsApp Business API nuances.

1.2 Non-functional requirements

Scalability — handle spikes (Black Friday, viral post) without dropping transactional messages.
Availability — ingestion API highly available; delivery can lag briefly but must complete.
Latency — transactional push/email < 30–60 s p95; marketing can be minutes.
Durability — accepted events are not lost before persistence.
Idempotency / deduplication — same logical event must not notify the user five times.
At-least-once delivery to providers with dedup on consumer side.
Compliance — CAN-SPAM, GDPR consent, SMS opt-in, unsubscribe links for email.
Observability — per-channel success rate, queue lag, provider error codes.

Assumptions for capacity math: 10M daily active users (DAU); 20 notifications per user per day average; 30% push, 50% email, 15% in-app, 5% SMS; peak burst 5× average during campaigns; 3 channels max per logical event after routing.

2. Capacity estimation

2.1 Notification volume

DAU = 10,000,000
Notifications per user per day = 20
Total notifications per day = 10M × 20 = 200,000,000

Average per second = 200M / 86,400 ≈ 2,300/sec
Peak (5×) ≈ 11,500/sec

Each logical event may fan out to multiple channel jobs—plan queue consumers for peak channel messages, not just ingest events.

2.2 Channel breakdown (daily)

Push:   200M × 0.30 = 60M/day  ≈ 694/sec avg
Email:  200M × 0.50 = 100M/day ≈ 1,157/sec avg
In-app: 200M × 0.15 = 30M/day  ≈ 347/sec avg
SMS:    200M × 0.05 = 10M/day  ≈ 116/sec avg

Email and push workers scale independently. SMS is low volume but high cost—strict rate limits.

2.3 Storage

Per notification record (metadata + status history):

Field group	Size (approx.)
IDs, user, template, channel, status	~200 bytes
Payload / rendered body (truncated in DB)	~500 bytes
Timestamps, provider message id	~100 bytes

Per notification ≈ 800 bytes
200M/day × 365 × 800 B ≈ 58 TB/year raw (order of magnitude)

Retention policy: keep 90 days hot in OLTP; archive to object storage

2.4 Queue and bandwidth

Kafka/SQS messages ~1–2 KB each → ~200–400 MB/s average ingest at peak if uncompressed JSON.
Compress payloads; partition by user_id hash for ordering per user.

2.5 Infrastructure sizing (starting point)

Component	Initial sizing
Ingest API	6–10 instances; stateless
Kafka cluster	Partition by priority + shard; 50+ partitions for parallelism
Push workers	Pool sized to FCM/APNs batch limits (~500–1000 tokens per batch)
Email workers	Pool sized to SES/SendGrid TPS quota
Scheduler	Cron + delayed queue (Redis ZSET or Kafka scheduled topics)
Metadata DB	PostgreSQL sharded by user_id or time partition

3. High-level design

Notification API — accepts send requests from internal services (authenticated).
Event bus — durable queue (Kafka) for decoupling producers from delivery.
Router / orchestrator — resolves template, preferences, locale; emits per-channel jobs.
Channel workers — push, email, SMS, in-app writers; each talks to external providers.
Template service — stores versions; renders with payload.
Preference store — user channel settings, quiet hours, opt-outs.
Device registry — FCM/APNs tokens per device; invalidate on bounce.
Scheduler — delayed and recurring notifications.
Status & analytics — delivery callbacks; dashboards.

flowchart TB
  subgraph producers [Product services]
    O[Order service]
    A[Auth service]
    M[Marketing]
  end
  subgraph platform [Notification platform]
    API[Notification API]
    BUS[(Event bus)]
    ORCH[Router / orchestrator]
    TPL[Templates]
    PREF[Preferences]
    DEV[Device registry]
    WP[Push worker]
    WE[Email worker]
    WS[SMS worker]
    WI[In-app worker]
    SCH[Scheduler]
    DB[(Notification store)]
  end
  subgraph providers [External providers]
    FCM[FCM / APNs]
    SES[Email provider]
    TW[SMS provider]
  end
  O --> API
  A --> API
  M --> API
  API --> BUS
  BUS --> ORCH
  ORCH --> TPL
  ORCH --> PREF
  ORCH --> DEV
  ORCH --> WP
  ORCH --> WE
  ORCH --> WS
  ORCH --> WI
  SCH --> BUS
  WP --> FCM
  WE --> SES
  WS --> TW
  WP --> DB
  WE --> DB
  WI --> DB

End-to-end flow

Order service calls POST /v1/notifications with user_id, template_id, payload, idempotency_key.
API validates, persists notification_request, publishes to Kafka topic by priority.
Router consumes: load preferences → skip opted-out channels → render templates → enqueue channel-specific messages.
Push worker batches device tokens → FCM/APNs → update status sent / failed.
In-app worker writes to user’s notification feed (DB or Cassandra).
Provider webhooks (email bounce, push failure) update device registry and status.

sequenceDiagram
  participant P as Producer service
  participant API as Notification API
  participant K as Kafka
  participant R as Router
  participant W as Push worker
  participant F as FCM
  participant U as User device
  P->>API: POST notification idempotency_key
  API->>K: publish event
  API-->>P: 202 accepted notification_id
  K->>R: consume
  R->>K: enqueue push.job
  K->>W: consume push.job
  W->>F: send multicast
  F-->>U: push delivered
  W->>W: update status delivered

4. Database design

4.1 Core tables

CREATE TABLE notification_requests (
  id                UUID PRIMARY KEY,
  idempotency_key   VARCHAR(128) NOT NULL,
  source_service    TEXT NOT NULL,
  template_key      TEXT NOT NULL,
  user_id           UUID NOT NULL,
  payload           JSONB NOT NULL,
  priority          TEXT NOT NULL DEFAULT 'normal',  -- critical | normal | marketing
  scheduled_at      TIMESTAMPTZ,
  status            TEXT NOT NULL DEFAULT 'accepted',
  created_at        TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  UNIQUE (source_service, idempotency_key)
);

CREATE TABLE notification_deliveries (
  id                UUID PRIMARY KEY,
  request_id        UUID NOT NULL REFERENCES notification_requests(id),
  user_id           UUID NOT NULL,
  channel           TEXT NOT NULL,  -- push | email | sms | in_app
  rendered_subject  TEXT,
  rendered_body     TEXT,
  status            TEXT NOT NULL,  -- pending | sent | delivered | failed | skipped
  skip_reason       TEXT,           -- opted_out | quiet_hours | no_device
  provider          TEXT,
  provider_msg_id   TEXT,
  attempt_count     INT NOT NULL DEFAULT 0,
  last_error        TEXT,
  created_at        TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  updated_at        TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE TABLE user_preferences (
  user_id           UUID PRIMARY KEY,
  push_enabled      BOOLEAN NOT NULL DEFAULT TRUE,
  email_enabled     BOOLEAN NOT NULL DEFAULT TRUE,
  sms_enabled       BOOLEAN NOT NULL DEFAULT FALSE,
  marketing_enabled BOOLEAN NOT NULL DEFAULT FALSE,
  quiet_hours_start TIME,
  quiet_hours_end   TIME,
  timezone          TEXT NOT NULL DEFAULT 'UTC',
  locale            TEXT NOT NULL DEFAULT 'en'
);

CREATE TABLE device_tokens (
  id          UUID PRIMARY KEY,
  user_id     UUID NOT NULL,
  platform    TEXT NOT NULL,  -- ios | android | web
  token       TEXT NOT NULL,
  is_valid    BOOLEAN NOT NULL DEFAULT TRUE,
  updated_at  TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  UNIQUE (token)
);

CREATE TABLE templates (
  id          UUID PRIMARY KEY,
  template_key TEXT NOT NULL,
  channel     TEXT NOT NULL,
  locale      TEXT NOT NULL DEFAULT 'en',
  version     INT NOT NULL,
  subject_tpl TEXT,
  body_tpl    TEXT NOT NULL,
  UNIQUE (template_key, channel, locale, version)
);

4.2 In-app feed (high volume)

For read-heavy notification centers, store feed items in Cassandra/DynamoDB partitioned by user_id, sorted by created_at DESC. Keep OLTP notification_deliveries for audit; duplicate lean feed row for UI.

4.3 Idempotency

Unique (source_service, idempotency_key) on request. For fan-out, unique (request_id, channel) on deliveries prevents duplicate channel sends on router retry.

5. API design

5.1 Send notification

POST /v1/notifications

Headers: Idempotency-Key: ord_ship_991

{
  "user_id": "usr_abc",
  "template_key": "order.shipped",
  "payload": {
    "order_id": "ord_991",
    "tracking_url": "https://track.example/991"
  },
  "channels": ["push", "email", "in_app"],
  "priority": "normal",
  "scheduled_at": null
}

202 Accepted

{
  "notification_id": "ntf_7k2",
  "status": "accepted"
}

5.2 Get user feed (in-app)

GET /v1/users/{user_id}/notifications?cursor=...

Returns paginated feed items with read/unread flag.

5.3 Update preferences

PATCH /v1/users/{user_id}/notification-preferences

{
  "email_enabled": true,
  "marketing_enabled": false,
  "quiet_hours_start": "22:00",
  "quiet_hours_end": "08:00",
  "timezone": "Asia/Kolkata"
}

5.4 Register device token

PUT /v1/users/{user_id}/devices

{
  "platform": "android",
  "token": "fcm_device_token_xyz"
}

6. Diving deep into key components

6.1 Priority queues

Do not let marketing blasts starve OTP emails.

Priority	Examples	Queue treatment
critical	OTP, password reset, fraud alert	Dedicated topic + workers; no marketing backpressure
normal	Order updates, mentions	Standard topic; fair scheduling
marketing	Promotions, newsletters	Low-priority topic; rate-limited; defer during overload

6.2 Fan-out strategies

Pattern	When	Tradeoff
Per-user on ingest	Single user notification	Simple; one Kafka message
Fan-out worker	Notify 10M users in segment	Stream segment IDs; avoid 10M API calls from producer
Hybrid	Large broadcast	Batch enqueue with rate limiter token bucket

6.3 Push delivery (FCM / APNs)

FCM — multicast up to 500 tokens; handle invalid token → mark device invalid.
APNs — HTTP/2; device token per iOS device; collapse id for replacing stale notifications.
Payload size — keep under 4 KB; deep link in data payload.
Silent push — background refresh; different priority headers.

6.4 Email delivery

Use ESP (SES, SendGrid) with DKIM/SPF configured.
Separate transactional vs marketing IP pools (reputation isolation).
Include List-Unsubscribe header for marketing.
Process bounces/complaints via webhook → suppress email for user.

6.5 SMS delivery

Expensive—reserve for critical alerts and markets that need SMS.
Strict opt-in; template pre-registration in some countries.
Low TPS per sender ID—queue with dedicated rate limiter.

6.6 Scheduling and quiet hours

If scheduled_at is future, write to delayed queue (Redis sorted set scored by timestamp, or Kafka with timestamp indexing). Router re-checks quiet hours at fire time—user in Mumbai may defer marketing push until 9am local.

6.7 Retries and dead-letter queue (DLQ)

Transient provider 5xx → retry with exponential backoff (max 5 attempts).
Permanent failure (invalid token, hard bounce) → mark failed; no infinite retry.
After max retries → DLQ for manual replay or auto replay when provider healthy.
Idempotent retry: same delivery_id; check provider idempotency if supported.

6.8 Template rendering

Store Handlebars/Mustache templates per template_key + channel + locale. Render in router with validated payload schema (JSON Schema). Missing field → fail fast at ingest, not at send time.

7. Failure points

Failure points are architectural locations where a fault creates lost messages, duplicates, wrong channel delivery, or unbounded backlog. Design assuming every point fails independently.

#	Failure point	What breaks	Detection	Mitigation design
FP1	Producer → Notification API	Timeout; producer retries with new key	Duplicate order-shipped emails	Idempotency-Key per business event
FP2	API → DB persist	DB down after accept	Producer got 500; event lost	Transactional outbox: DB + Kafka in one TX or outbox relay
FP3	API → Kafka publish	Message not in bus	Request row stuck `accepted` never routed	Outbox poller; reconciliation job scans stuck requests
FP4	Router consumer	Crash after enqueue push but before commit offset	Duplicate push on redelivery	Unique (request_id, channel); idempotent router
FP5	Preference / template lookup	Stale cache shows opted-in	User complaint; compliance risk	Short TTL + version on prefs; critical path reads primary
FP6	Channel worker → provider	FCM timeout; unknown if sent	Status stuck `pending`	Query provider receipt API; mark ambiguous; safe retry rules
FP7	Provider → device	APNs token invalid	High failure rate for user	Invalidate token; fallback to email if policy allows
FP8	Provider webhook ingress	Bounce webhook lost	Send to dead addresses repeatedly	Webhook retries; sync suppression list job
FP9	Scheduler / delayed queue	Clock skew; lost delayed jobs	Reminder never sent	Persistent scheduled table + scanner; not only memory queue
FP10	Broadcast fan-out	Segment job overwhelms cluster	Critical OTP lag hours	Isolate marketing topic; rate limit; autoscale workers

flowchart LR
  P[Producer] -->|FP1| API[Notification API]
  API -->|FP2| DB[(DB)]
  API -->|FP3| K[Kafka]
  K -->|FP4| R[Router]
  R -->|FP5| PREF[Preferences]
  R --> W[Channel workers]
  W -->|FP6| PR[FCM / SES / SMS]
  PR -->|FP7| DEV[User device]
  PR -->|FP8| WH[Provider webhooks]
  SCH[Scheduler] -->|FP9| K
  MKT[Marketing fan-out] -->|FP10| K

8. Failure modes

Failure modes describe recurring failure patterns across those points—what operators see, why it happens, and the safe system response.

8.1 Duplicate notification (at-least-once everywhere)

Symptom: User receives three “order shipped” pushes.

Cause: Producer retry without idempotency; router redelivery; worker retry without dedup key.

Safe response: Idempotency on request; unique delivery row per channel; FCM collapse key for same logical event.

8.2 Lost notification (silent drop)

Symptom: User never got OTP; logs show accepted then nothing.

Cause: FP2/FP3 — not persisted or not published to Kafka.

Safe response: Outbox pattern; alert on requests accepted > 5 min without delivery row; reconciliation scanner.

8.3 Poison message (bad payload)

Symptom: Consumer stuck retrying same message; lag grows.

Cause: Invalid template variable; schema drift.

Safe response: Validate at API; DLQ after N failures; schema registry for templates.

8.4 Provider throttle (rate limit)

Symptom: Mass 429 from SES; delivery delay hours.

Cause: FP10 marketing blast exceeds TPS quota.

Safe response: Token bucket per provider; shed marketing first; request quota increase; shard across multiple ESP sub-accounts carefully (compliance).

8.5 Priority inversion

Symptom: OTP delayed; marketing emails went out fine earlier.

Cause: Shared worker pool saturated by campaign.

Safe response: Separate topics and worker deployments per priority; WFQ weights.

8.6 Stale device token

Symptom: Push always fails for user; no fallback.

Cause: FP7 — user reinstalled app; old token not updated.

Safe response: Invalidate on permanent error; prompt re-register token; optional email fallback for critical.

8.7 Quiet hours violation

Symptom: Marketing push at 2am local.

Cause: FP5 — cached prefs; scheduled job used UTC not user TZ.

Safe response: Re-evaluate quiet hours at send time in user timezone; defer non-critical.

8.8 Ambiguous provider timeout

Symptom: Unknown if email sent; retry may duplicate.

Cause: FP6 — HTTP timeout to SendGrid.

Safe response: Status unknown; query provider by message id before retry; idempotent send API if available.

8.9 In-app feed out of sync

Symptom: Push received but inbox empty.

Cause: In-app worker failed while push succeeded.

Safe response: Independent delivery rows per channel; partial success is valid state; UI merges channels.

8.10 DLQ pile-up (cascading backlog)

Symptom: Millions in DLQ after provider outage.

Cause: All workers failed together; replay floods on recovery.

Safe response: Gradual DLQ replay with rate limit; priority order; extend retention on Kafka.

Failure mode	Primary failure points	User impact	Core mitigation
Duplicate notification	FP1, FP4, FP6	Spam	Idempotency + dedup keys
Lost notification	FP2, FP3	Missed alert	Outbox + reconciliation
Poison message	FP4	Backlog for all	DLQ + schema validation
Provider throttle	FP6, FP10	Delay	Rate limits + priority queues
Priority inversion	FP10	OTP late	Isolated critical path
Stale device token	FP7	No push	Token hygiene + fallback
Quiet hours violation	FP5, FP9	Trust loss	TZ-aware scheduler
Ambiguous timeout	FP6	Duplicate or miss	Status probe before retry
Partial channel success	FP4, FP7	Confusing UX	Per-channel status
DLQ pile-up	FP6	Delayed catch-up	Throttled replay

9. Scalability, availability, and security

9.1 Scalability

Horizontally scale stateless API and channel workers.
Partition Kafka by user_id for per-user ordering (all channels for one user serialized if needed).
Shard notification store by user_id or time.
Batch push/email API calls to respect provider limits.
Precompute segments for marketing offline (warehouse → object store → fan-out job).

9.2 Availability

Multi-AZ Kafka and DB; cross-region read replicas for feed API.
Degrade marketing under load; never drop critical topic consumers.
Provider failover: secondary ESP for email if primary region down (advanced).
Graceful API response: 202 Accepted even if delivery async—don’t block producer on FCM latency.

9.3 Security and compliance

mTLS or signed internal tokens on Notification API.
No PII in Kafka payloads if possible—send user_id and resolve email server-side.
Encrypt device tokens at rest.
Audit log for preference changes and marketing sends.
Honor unsubscribe and GDPR erasure—delete feed + suppress address.

10. Tradeoffs recap

Decision	Common choice	Why
Pull vs push model	Push to bus (async)	Decouple producer latency from FCM/SES
Consistency	Eventual delivery OK	Not financial ledger; favor availability
Dedup	Business idempotency key	Cheaper than exactly-once Kafka
Feed storage	Cassandra for inbox	High write/read per user
Exactly-once vs at-least-once	At-least-once + idempotent consumers	Simpler ops

11. How to present this in 45 minutes

5 min — requirements; channels; transactional vs marketing priority.
7 min — capacity: 200M/day, ~11.5k peak/sec, channel split.
8 min — diagram: API → Kafka → router → workers → providers.
8 min — schema, templates, preferences, idempotency.
10 min — failure points + top failure modes (duplicate, lost, throttle, priority inversion).
7 min — retries, DLQ, scheduling, tradeoffs.

The one line to remember

A notification system is an async fan-out pipeline: accept events durably, route with preferences and templates, deliver per channel with retries—and use idempotency everywhere so at-least-once infrastructure does not become at-least-three-times spam for the user.