sharpbyte.dev

Design a notification system

A notification system delivers timely messages to users across channels—push, email, SMS, and in-app—when something happens in your product: order shipped, password reset, friend mentioned you, or marketing campaign. Unlike a simple “send email” script, production systems must handle millions of recipients, respect user preferences, survive provider outages, and avoid duplicate or lost messages when everything retries at once.

This guide walks the full interview arc—requirements, capacity, architecture, storage, APIs, channel adapters—and dedicates sections to failure points and failure modes, the same depth as the URL shortener and payment system guides on this site.

Design prompt

Design a notification platform that accepts events from product services and delivers messages to users on their preferred channels.

Support high throughput, retries, scheduling, and at-least-once ingestion without duplicate user-visible spam.

What you should be able to do after reading:

1. Requirements gathering

1.1 Functional requirements

Usually out of scope unless asked: building your own SMTP server, in-app chat, full campaign A/B analytics platform, WhatsApp Business API nuances.

1.2 Non-functional requirements

Assumptions for capacity math: 10M daily active users (DAU); 20 notifications per user per day average; 30% push, 50% email, 15% in-app, 5% SMS; peak burst 5× average during campaigns; 3 channels max per logical event after routing.

2. Capacity estimation

2.1 Notification volume

DAU = 10,000,000
Notifications per user per day = 20
Total notifications per day = 10M × 20 = 200,000,000

Average per second = 200M / 86,400 ≈ 2,300/sec
Peak (5×) ≈ 11,500/sec

Each logical event may fan out to multiple channel jobs—plan queue consumers for peak channel messages, not just ingest events.

2.2 Channel breakdown (daily)

Push:   200M × 0.30 = 60M/day  ≈ 694/sec avg
Email:  200M × 0.50 = 100M/day ≈ 1,157/sec avg
In-app: 200M × 0.15 = 30M/day  ≈ 347/sec avg
SMS:    200M × 0.05 = 10M/day  ≈ 116/sec avg

Email and push workers scale independently. SMS is low volume but high cost—strict rate limits.

2.3 Storage

Per notification record (metadata + status history):

Field groupSize (approx.)
IDs, user, template, channel, status~200 bytes
Payload / rendered body (truncated in DB)~500 bytes
Timestamps, provider message id~100 bytes
Per notification ≈ 800 bytes
200M/day × 365 × 800 B ≈ 58 TB/year raw (order of magnitude)

Retention policy: keep 90 days hot in OLTP; archive to object storage

2.4 Queue and bandwidth

2.5 Infrastructure sizing (starting point)

ComponentInitial sizing
Ingest API6–10 instances; stateless
Kafka clusterPartition by priority + shard; 50+ partitions for parallelism
Push workersPool sized to FCM/APNs batch limits (~500–1000 tokens per batch)
Email workersPool sized to SES/SendGrid TPS quota
SchedulerCron + delayed queue (Redis ZSET or Kafka scheduled topics)
Metadata DBPostgreSQL sharded by user_id or time partition

3. High-level design

flowchart TB
  subgraph producers [Product services]
    O[Order service]
    A[Auth service]
    M[Marketing]
  end
  subgraph platform [Notification platform]
    API[Notification API]
    BUS[(Event bus)]
    ORCH[Router / orchestrator]
    TPL[Templates]
    PREF[Preferences]
    DEV[Device registry]
    WP[Push worker]
    WE[Email worker]
    WS[SMS worker]
    WI[In-app worker]
    SCH[Scheduler]
    DB[(Notification store)]
  end
  subgraph providers [External providers]
    FCM[FCM / APNs]
    SES[Email provider]
    TW[SMS provider]
  end
  O --> API
  A --> API
  M --> API
  API --> BUS
  BUS --> ORCH
  ORCH --> TPL
  ORCH --> PREF
  ORCH --> DEV
  ORCH --> WP
  ORCH --> WE
  ORCH --> WS
  ORCH --> WI
  SCH --> BUS
  WP --> FCM
  WE --> SES
  WS --> TW
  WP --> DB
  WE --> DB
  WI --> DB
    

End-to-end flow

  1. Order service calls POST /v1/notifications with user_id, template_id, payload, idempotency_key.
  2. API validates, persists notification_request, publishes to Kafka topic by priority.
  3. Router consumes: load preferences → skip opted-out channels → render templates → enqueue channel-specific messages.
  4. Push worker batches device tokens → FCM/APNs → update status sent / failed.
  5. In-app worker writes to user’s notification feed (DB or Cassandra).
  6. Provider webhooks (email bounce, push failure) update device registry and status.
sequenceDiagram
  participant P as Producer service
  participant API as Notification API
  participant K as Kafka
  participant R as Router
  participant W as Push worker
  participant F as FCM
  participant U as User device
  P->>API: POST notification idempotency_key
  API->>K: publish event
  API-->>P: 202 accepted notification_id
  K->>R: consume
  R->>K: enqueue push.job
  K->>W: consume push.job
  W->>F: send multicast
  F-->>U: push delivered
  W->>W: update status delivered
    

4. Database design

4.1 Core tables

CREATE TABLE notification_requests (
  id                UUID PRIMARY KEY,
  idempotency_key   VARCHAR(128) NOT NULL,
  source_service    TEXT NOT NULL,
  template_key      TEXT NOT NULL,
  user_id           UUID NOT NULL,
  payload           JSONB NOT NULL,
  priority          TEXT NOT NULL DEFAULT 'normal',  -- critical | normal | marketing
  scheduled_at      TIMESTAMPTZ,
  status            TEXT NOT NULL DEFAULT 'accepted',
  created_at        TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  UNIQUE (source_service, idempotency_key)
);

CREATE TABLE notification_deliveries (
  id                UUID PRIMARY KEY,
  request_id        UUID NOT NULL REFERENCES notification_requests(id),
  user_id           UUID NOT NULL,
  channel           TEXT NOT NULL,  -- push | email | sms | in_app
  rendered_subject  TEXT,
  rendered_body     TEXT,
  status            TEXT NOT NULL,  -- pending | sent | delivered | failed | skipped
  skip_reason       TEXT,           -- opted_out | quiet_hours | no_device
  provider          TEXT,
  provider_msg_id   TEXT,
  attempt_count     INT NOT NULL DEFAULT 0,
  last_error        TEXT,
  created_at        TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  updated_at        TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE TABLE user_preferences (
  user_id           UUID PRIMARY KEY,
  push_enabled      BOOLEAN NOT NULL DEFAULT TRUE,
  email_enabled     BOOLEAN NOT NULL DEFAULT TRUE,
  sms_enabled       BOOLEAN NOT NULL DEFAULT FALSE,
  marketing_enabled BOOLEAN NOT NULL DEFAULT FALSE,
  quiet_hours_start TIME,
  quiet_hours_end   TIME,
  timezone          TEXT NOT NULL DEFAULT 'UTC',
  locale            TEXT NOT NULL DEFAULT 'en'
);

CREATE TABLE device_tokens (
  id          UUID PRIMARY KEY,
  user_id     UUID NOT NULL,
  platform    TEXT NOT NULL,  -- ios | android | web
  token       TEXT NOT NULL,
  is_valid    BOOLEAN NOT NULL DEFAULT TRUE,
  updated_at  TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  UNIQUE (token)
);

CREATE TABLE templates (
  id          UUID PRIMARY KEY,
  template_key TEXT NOT NULL,
  channel     TEXT NOT NULL,
  locale      TEXT NOT NULL DEFAULT 'en',
  version     INT NOT NULL,
  subject_tpl TEXT,
  body_tpl    TEXT NOT NULL,
  UNIQUE (template_key, channel, locale, version)
);

4.2 In-app feed (high volume)

For read-heavy notification centers, store feed items in Cassandra/DynamoDB partitioned by user_id, sorted by created_at DESC. Keep OLTP notification_deliveries for audit; duplicate lean feed row for UI.

4.3 Idempotency

Unique (source_service, idempotency_key) on request. For fan-out, unique (request_id, channel) on deliveries prevents duplicate channel sends on router retry.

5. API design

5.1 Send notification

POST /v1/notifications

Headers: Idempotency-Key: ord_ship_991

{
  "user_id": "usr_abc",
  "template_key": "order.shipped",
  "payload": {
    "order_id": "ord_991",
    "tracking_url": "https://track.example/991"
  },
  "channels": ["push", "email", "in_app"],
  "priority": "normal",
  "scheduled_at": null
}

202 Accepted

{
  "notification_id": "ntf_7k2",
  "status": "accepted"
}

5.2 Get user feed (in-app)

GET /v1/users/{user_id}/notifications?cursor=...

Returns paginated feed items with read/unread flag.

5.3 Update preferences

PATCH /v1/users/{user_id}/notification-preferences

{
  "email_enabled": true,
  "marketing_enabled": false,
  "quiet_hours_start": "22:00",
  "quiet_hours_end": "08:00",
  "timezone": "Asia/Kolkata"
}

5.4 Register device token

PUT /v1/users/{user_id}/devices

{
  "platform": "android",
  "token": "fcm_device_token_xyz"
}

6. Diving deep into key components

6.1 Priority queues

Do not let marketing blasts starve OTP emails.

PriorityExamplesQueue treatment
criticalOTP, password reset, fraud alertDedicated topic + workers; no marketing backpressure
normalOrder updates, mentionsStandard topic; fair scheduling
marketingPromotions, newslettersLow-priority topic; rate-limited; defer during overload

6.2 Fan-out strategies

PatternWhenTradeoff
Per-user on ingestSingle user notificationSimple; one Kafka message
Fan-out workerNotify 10M users in segmentStream segment IDs; avoid 10M API calls from producer
HybridLarge broadcastBatch enqueue with rate limiter token bucket

6.3 Push delivery (FCM / APNs)

6.4 Email delivery

6.5 SMS delivery

6.6 Scheduling and quiet hours

If scheduled_at is future, write to delayed queue (Redis sorted set scored by timestamp, or Kafka with timestamp indexing). Router re-checks quiet hours at fire time—user in Mumbai may defer marketing push until 9am local.

6.7 Retries and dead-letter queue (DLQ)

  1. Transient provider 5xx → retry with exponential backoff (max 5 attempts).
  2. Permanent failure (invalid token, hard bounce) → mark failed; no infinite retry.
  3. After max retries → DLQ for manual replay or auto replay when provider healthy.
  4. Idempotent retry: same delivery_id; check provider idempotency if supported.

6.8 Template rendering

Store Handlebars/Mustache templates per template_key + channel + locale. Render in router with validated payload schema (JSON Schema). Missing field → fail fast at ingest, not at send time.

7. Failure points

Failure points are architectural locations where a fault creates lost messages, duplicates, wrong channel delivery, or unbounded backlog. Design assuming every point fails independently.

#Failure pointWhat breaksDetectionMitigation design
FP1 Producer → Notification API Timeout; producer retries with new key Duplicate order-shipped emails Idempotency-Key per business event
FP2 API → DB persist DB down after accept Producer got 500; event lost Transactional outbox: DB + Kafka in one TX or outbox relay
FP3 API → Kafka publish Message not in bus Request row stuck accepted never routed Outbox poller; reconciliation job scans stuck requests
FP4 Router consumer Crash after enqueue push but before commit offset Duplicate push on redelivery Unique (request_id, channel); idempotent router
FP5 Preference / template lookup Stale cache shows opted-in User complaint; compliance risk Short TTL + version on prefs; critical path reads primary
FP6 Channel worker → provider FCM timeout; unknown if sent Status stuck pending Query provider receipt API; mark ambiguous; safe retry rules
FP7 Provider → device APNs token invalid High failure rate for user Invalidate token; fallback to email if policy allows
FP8 Provider webhook ingress Bounce webhook lost Send to dead addresses repeatedly Webhook retries; sync suppression list job
FP9 Scheduler / delayed queue Clock skew; lost delayed jobs Reminder never sent Persistent scheduled table + scanner; not only memory queue
FP10 Broadcast fan-out Segment job overwhelms cluster Critical OTP lag hours Isolate marketing topic; rate limit; autoscale workers
flowchart LR
  P[Producer] -->|FP1| API[Notification API]
  API -->|FP2| DB[(DB)]
  API -->|FP3| K[Kafka]
  K -->|FP4| R[Router]
  R -->|FP5| PREF[Preferences]
  R --> W[Channel workers]
  W -->|FP6| PR[FCM / SES / SMS]
  PR -->|FP7| DEV[User device]
  PR -->|FP8| WH[Provider webhooks]
  SCH[Scheduler] -->|FP9| K
  MKT[Marketing fan-out] -->|FP10| K
    

8. Failure modes

Failure modes describe recurring failure patterns across those points—what operators see, why it happens, and the safe system response.

8.1 Duplicate notification (at-least-once everywhere)

Symptom: User receives three “order shipped” pushes.

Cause: Producer retry without idempotency; router redelivery; worker retry without dedup key.

Safe response: Idempotency on request; unique delivery row per channel; FCM collapse key for same logical event.

8.2 Lost notification (silent drop)

Symptom: User never got OTP; logs show accepted then nothing.

Cause: FP2/FP3 — not persisted or not published to Kafka.

Safe response: Outbox pattern; alert on requests accepted > 5 min without delivery row; reconciliation scanner.

8.3 Poison message (bad payload)

Symptom: Consumer stuck retrying same message; lag grows.

Cause: Invalid template variable; schema drift.

Safe response: Validate at API; DLQ after N failures; schema registry for templates.

8.4 Provider throttle (rate limit)

Symptom: Mass 429 from SES; delivery delay hours.

Cause: FP10 marketing blast exceeds TPS quota.

Safe response: Token bucket per provider; shed marketing first; request quota increase; shard across multiple ESP sub-accounts carefully (compliance).

8.5 Priority inversion

Symptom: OTP delayed; marketing emails went out fine earlier.

Cause: Shared worker pool saturated by campaign.

Safe response: Separate topics and worker deployments per priority; WFQ weights.

8.6 Stale device token

Symptom: Push always fails for user; no fallback.

Cause: FP7 — user reinstalled app; old token not updated.

Safe response: Invalidate on permanent error; prompt re-register token; optional email fallback for critical.

8.7 Quiet hours violation

Symptom: Marketing push at 2am local.

Cause: FP5 — cached prefs; scheduled job used UTC not user TZ.

Safe response: Re-evaluate quiet hours at send time in user timezone; defer non-critical.

8.8 Ambiguous provider timeout

Symptom: Unknown if email sent; retry may duplicate.

Cause: FP6 — HTTP timeout to SendGrid.

Safe response: Status unknown; query provider by message id before retry; idempotent send API if available.

8.9 In-app feed out of sync

Symptom: Push received but inbox empty.

Cause: In-app worker failed while push succeeded.

Safe response: Independent delivery rows per channel; partial success is valid state; UI merges channels.

8.10 DLQ pile-up (cascading backlog)

Symptom: Millions in DLQ after provider outage.

Cause: All workers failed together; replay floods on recovery.

Safe response: Gradual DLQ replay with rate limit; priority order; extend retention on Kafka.

Failure modePrimary failure pointsUser impactCore mitigation
Duplicate notificationFP1, FP4, FP6SpamIdempotency + dedup keys
Lost notificationFP2, FP3Missed alertOutbox + reconciliation
Poison messageFP4Backlog for allDLQ + schema validation
Provider throttleFP6, FP10DelayRate limits + priority queues
Priority inversionFP10OTP lateIsolated critical path
Stale device tokenFP7No pushToken hygiene + fallback
Quiet hours violationFP5, FP9Trust lossTZ-aware scheduler
Ambiguous timeoutFP6Duplicate or missStatus probe before retry
Partial channel successFP4, FP7Confusing UXPer-channel status
DLQ pile-upFP6Delayed catch-upThrottled replay

9. Scalability, availability, and security

9.1 Scalability

9.2 Availability

9.3 Security and compliance

10. Tradeoffs recap

DecisionCommon choiceWhy
Pull vs push modelPush to bus (async)Decouple producer latency from FCM/SES
ConsistencyEventual delivery OKNot financial ledger; favor availability
DedupBusiness idempotency keyCheaper than exactly-once Kafka
Feed storageCassandra for inboxHigh write/read per user
Exactly-once vs at-least-onceAt-least-once + idempotent consumersSimpler ops

11. How to present this in 45 minutes

  1. 5 min — requirements; channels; transactional vs marketing priority.
  2. 7 min — capacity: 200M/day, ~11.5k peak/sec, channel split.
  3. 8 min — diagram: API → Kafka → router → workers → providers.
  4. 8 min — schema, templates, preferences, idempotency.
  5. 10 minfailure points + top failure modes (duplicate, lost, throttle, priority inversion).
  6. 7 min — retries, DLQ, scheduling, tradeoffs.

The one line to remember

A notification system is an async fan-out pipeline: accept events durably, route with preferences and templates, deliver per channel with retries—and use idempotency everywhere so at-least-once infrastructure does not become at-least-three-times spam for the user.