Design a notification system
A notification system delivers timely messages to users across channels—push, email, SMS, and in-app—when something happens in your product: order shipped, password reset, friend mentioned you, or marketing campaign. Unlike a simple “send email” script, production systems must handle millions of recipients, respect user preferences, survive provider outages, and avoid duplicate or lost messages when everything retries at once.
This guide walks the full interview arc—requirements, capacity, architecture, storage, APIs, channel adapters—and dedicates sections to failure points and failure modes, the same depth as the URL shortener and payment system guides on this site.
Design prompt
Design a notification platform that accepts events from product services and delivers messages to users on their preferred channels.
Support high throughput, retries, scheduling, and at-least-once ingestion without duplicate user-visible spam.
What you should be able to do after reading:
- Separate event ingestion, fan-out, channel routing, and delivery workers.
- Size queue throughput for marketing bursts vs steady transactional traffic.
- Model templates, user preferences, device tokens, and delivery status.
- Explain push (FCM/APNs), email, and SMS provider constraints.
- Map failure points and failure modes with idempotency and DLQ recovery.
1. Requirements gathering
1.1 Functional requirements
- Send notification — product teams publish an event (e.g.
order.shipped) with payload and target user(s). - Multi-channel delivery — push, email, SMS, in-app inbox (and optionally webhook to merchant).
- User preferences — per-channel opt-in/out, quiet hours, locale, category toggles (marketing vs transactional).
- Templates — parameterized subject/body per channel; i18n variants.
- Scheduling — send now or at a future time (reminders, digest at 9am).
- Priority — transactional (OTP, security) bypasses marketing caps; rate limits differ.
- Delivery tracking — queued, sent, delivered, failed, opened (where provider supports).
- Bulk / broadcast (optional) — send to segment or all users (marketing blast).
Usually out of scope unless asked: building your own SMTP server, in-app chat, full campaign A/B analytics platform, WhatsApp Business API nuances.
1.2 Non-functional requirements
- Scalability — handle spikes (Black Friday, viral post) without dropping transactional messages.
- Availability — ingestion API highly available; delivery can lag briefly but must complete.
- Latency — transactional push/email < 30–60 s p95; marketing can be minutes.
- Durability — accepted events are not lost before persistence.
- Idempotency / deduplication — same logical event must not notify the user five times.
- At-least-once delivery to providers with dedup on consumer side.
- Compliance — CAN-SPAM, GDPR consent, SMS opt-in, unsubscribe links for email.
- Observability — per-channel success rate, queue lag, provider error codes.
Assumptions for capacity math: 10M daily active users (DAU); 20 notifications per user per day average; 30% push, 50% email, 15% in-app, 5% SMS; peak burst 5× average during campaigns; 3 channels max per logical event after routing.
2. Capacity estimation
2.1 Notification volume
DAU = 10,000,000 Notifications per user per day = 20 Total notifications per day = 10M × 20 = 200,000,000 Average per second = 200M / 86,400 ≈ 2,300/sec Peak (5×) ≈ 11,500/sec
Each logical event may fan out to multiple channel jobs—plan queue consumers for peak channel messages, not just ingest events.
2.2 Channel breakdown (daily)
Push: 200M × 0.30 = 60M/day ≈ 694/sec avg Email: 200M × 0.50 = 100M/day ≈ 1,157/sec avg In-app: 200M × 0.15 = 30M/day ≈ 347/sec avg SMS: 200M × 0.05 = 10M/day ≈ 116/sec avg
Email and push workers scale independently. SMS is low volume but high cost—strict rate limits.
2.3 Storage
Per notification record (metadata + status history):
| Field group | Size (approx.) |
|---|---|
| IDs, user, template, channel, status | ~200 bytes |
| Payload / rendered body (truncated in DB) | ~500 bytes |
| Timestamps, provider message id | ~100 bytes |
Per notification ≈ 800 bytes 200M/day × 365 × 800 B ≈ 58 TB/year raw (order of magnitude) Retention policy: keep 90 days hot in OLTP; archive to object storage
2.4 Queue and bandwidth
- Kafka/SQS messages ~1–2 KB each → ~200–400 MB/s average ingest at peak if uncompressed JSON.
- Compress payloads; partition by
user_idhash for ordering per user.
2.5 Infrastructure sizing (starting point)
| Component | Initial sizing |
|---|---|
| Ingest API | 6–10 instances; stateless |
| Kafka cluster | Partition by priority + shard; 50+ partitions for parallelism |
| Push workers | Pool sized to FCM/APNs batch limits (~500–1000 tokens per batch) |
| Email workers | Pool sized to SES/SendGrid TPS quota |
| Scheduler | Cron + delayed queue (Redis ZSET or Kafka scheduled topics) |
| Metadata DB | PostgreSQL sharded by user_id or time partition |
3. High-level design
- Notification API — accepts send requests from internal services (authenticated).
- Event bus — durable queue (Kafka) for decoupling producers from delivery.
- Router / orchestrator — resolves template, preferences, locale; emits per-channel jobs.
- Channel workers — push, email, SMS, in-app writers; each talks to external providers.
- Template service — stores versions; renders with payload.
- Preference store — user channel settings, quiet hours, opt-outs.
- Device registry — FCM/APNs tokens per device; invalidate on bounce.
- Scheduler — delayed and recurring notifications.
- Status & analytics — delivery callbacks; dashboards.
flowchart TB
subgraph producers [Product services]
O[Order service]
A[Auth service]
M[Marketing]
end
subgraph platform [Notification platform]
API[Notification API]
BUS[(Event bus)]
ORCH[Router / orchestrator]
TPL[Templates]
PREF[Preferences]
DEV[Device registry]
WP[Push worker]
WE[Email worker]
WS[SMS worker]
WI[In-app worker]
SCH[Scheduler]
DB[(Notification store)]
end
subgraph providers [External providers]
FCM[FCM / APNs]
SES[Email provider]
TW[SMS provider]
end
O --> API
A --> API
M --> API
API --> BUS
BUS --> ORCH
ORCH --> TPL
ORCH --> PREF
ORCH --> DEV
ORCH --> WP
ORCH --> WE
ORCH --> WS
ORCH --> WI
SCH --> BUS
WP --> FCM
WE --> SES
WS --> TW
WP --> DB
WE --> DB
WI --> DB
End-to-end flow
- Order service calls
POST /v1/notificationswithuser_id,template_id, payload,idempotency_key. - API validates, persists
notification_request, publishes to Kafka topic by priority. - Router consumes: load preferences → skip opted-out channels → render templates → enqueue channel-specific messages.
- Push worker batches device tokens → FCM/APNs → update status
sent/failed. - In-app worker writes to user’s notification feed (DB or Cassandra).
- Provider webhooks (email bounce, push failure) update device registry and status.
sequenceDiagram
participant P as Producer service
participant API as Notification API
participant K as Kafka
participant R as Router
participant W as Push worker
participant F as FCM
participant U as User device
P->>API: POST notification idempotency_key
API->>K: publish event
API-->>P: 202 accepted notification_id
K->>R: consume
R->>K: enqueue push.job
K->>W: consume push.job
W->>F: send multicast
F-->>U: push delivered
W->>W: update status delivered
4. Database design
4.1 Core tables
CREATE TABLE notification_requests ( id UUID PRIMARY KEY, idempotency_key VARCHAR(128) NOT NULL, source_service TEXT NOT NULL, template_key TEXT NOT NULL, user_id UUID NOT NULL, payload JSONB NOT NULL, priority TEXT NOT NULL DEFAULT 'normal', -- critical | normal | marketing scheduled_at TIMESTAMPTZ, status TEXT NOT NULL DEFAULT 'accepted', created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), UNIQUE (source_service, idempotency_key) ); CREATE TABLE notification_deliveries ( id UUID PRIMARY KEY, request_id UUID NOT NULL REFERENCES notification_requests(id), user_id UUID NOT NULL, channel TEXT NOT NULL, -- push | email | sms | in_app rendered_subject TEXT, rendered_body TEXT, status TEXT NOT NULL, -- pending | sent | delivered | failed | skipped skip_reason TEXT, -- opted_out | quiet_hours | no_device provider TEXT, provider_msg_id TEXT, attempt_count INT NOT NULL DEFAULT 0, last_error TEXT, created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW() ); CREATE TABLE user_preferences ( user_id UUID PRIMARY KEY, push_enabled BOOLEAN NOT NULL DEFAULT TRUE, email_enabled BOOLEAN NOT NULL DEFAULT TRUE, sms_enabled BOOLEAN NOT NULL DEFAULT FALSE, marketing_enabled BOOLEAN NOT NULL DEFAULT FALSE, quiet_hours_start TIME, quiet_hours_end TIME, timezone TEXT NOT NULL DEFAULT 'UTC', locale TEXT NOT NULL DEFAULT 'en' ); CREATE TABLE device_tokens ( id UUID PRIMARY KEY, user_id UUID NOT NULL, platform TEXT NOT NULL, -- ios | android | web token TEXT NOT NULL, is_valid BOOLEAN NOT NULL DEFAULT TRUE, updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), UNIQUE (token) ); CREATE TABLE templates ( id UUID PRIMARY KEY, template_key TEXT NOT NULL, channel TEXT NOT NULL, locale TEXT NOT NULL DEFAULT 'en', version INT NOT NULL, subject_tpl TEXT, body_tpl TEXT NOT NULL, UNIQUE (template_key, channel, locale, version) );
4.2 In-app feed (high volume)
For read-heavy notification centers, store feed items in Cassandra/DynamoDB partitioned by user_id, sorted by created_at DESC. Keep OLTP notification_deliveries for audit; duplicate lean feed row for UI.
4.3 Idempotency
Unique (source_service, idempotency_key) on request. For fan-out, unique (request_id, channel) on deliveries prevents duplicate channel sends on router retry.
5. API design
5.1 Send notification
POST /v1/notifications
Headers: Idempotency-Key: ord_ship_991
{
"user_id": "usr_abc",
"template_key": "order.shipped",
"payload": {
"order_id": "ord_991",
"tracking_url": "https://track.example/991"
},
"channels": ["push", "email", "in_app"],
"priority": "normal",
"scheduled_at": null
}
202 Accepted
{
"notification_id": "ntf_7k2",
"status": "accepted"
}
5.2 Get user feed (in-app)
GET /v1/users/{user_id}/notifications?cursor=...
Returns paginated feed items with read/unread flag.
5.3 Update preferences
PATCH /v1/users/{user_id}/notification-preferences
{
"email_enabled": true,
"marketing_enabled": false,
"quiet_hours_start": "22:00",
"quiet_hours_end": "08:00",
"timezone": "Asia/Kolkata"
}
5.4 Register device token
PUT /v1/users/{user_id}/devices
{
"platform": "android",
"token": "fcm_device_token_xyz"
}
6. Diving deep into key components
6.1 Priority queues
Do not let marketing blasts starve OTP emails.
| Priority | Examples | Queue treatment |
|---|---|---|
| critical | OTP, password reset, fraud alert | Dedicated topic + workers; no marketing backpressure |
| normal | Order updates, mentions | Standard topic; fair scheduling |
| marketing | Promotions, newsletters | Low-priority topic; rate-limited; defer during overload |
6.2 Fan-out strategies
| Pattern | When | Tradeoff |
|---|---|---|
| Per-user on ingest | Single user notification | Simple; one Kafka message |
| Fan-out worker | Notify 10M users in segment | Stream segment IDs; avoid 10M API calls from producer |
| Hybrid | Large broadcast | Batch enqueue with rate limiter token bucket |
6.3 Push delivery (FCM / APNs)
- FCM — multicast up to 500 tokens; handle invalid token → mark device invalid.
- APNs — HTTP/2; device token per iOS device; collapse id for replacing stale notifications.
- Payload size — keep under 4 KB; deep link in data payload.
- Silent push — background refresh; different priority headers.
6.4 Email delivery
- Use ESP (SES, SendGrid) with DKIM/SPF configured.
- Separate transactional vs marketing IP pools (reputation isolation).
- Include List-Unsubscribe header for marketing.
- Process bounces/complaints via webhook → suppress email for user.
6.5 SMS delivery
- Expensive—reserve for critical alerts and markets that need SMS.
- Strict opt-in; template pre-registration in some countries.
- Low TPS per sender ID—queue with dedicated rate limiter.
6.6 Scheduling and quiet hours
If scheduled_at is future, write to delayed queue (Redis sorted set scored by timestamp, or Kafka with timestamp indexing). Router re-checks quiet hours at fire time—user in Mumbai may defer marketing push until 9am local.
6.7 Retries and dead-letter queue (DLQ)
- Transient provider 5xx → retry with exponential backoff (max 5 attempts).
- Permanent failure (invalid token, hard bounce) → mark failed; no infinite retry.
- After max retries → DLQ for manual replay or auto replay when provider healthy.
- Idempotent retry: same
delivery_id; check provider idempotency if supported.
6.8 Template rendering
Store Handlebars/Mustache templates per template_key + channel + locale. Render in router with validated payload schema (JSON Schema). Missing field → fail fast at ingest, not at send time.
7. Failure points
Failure points are architectural locations where a fault creates lost messages, duplicates, wrong channel delivery, or unbounded backlog. Design assuming every point fails independently.
| # | Failure point | What breaks | Detection | Mitigation design |
|---|---|---|---|---|
| FP1 | Producer → Notification API | Timeout; producer retries with new key | Duplicate order-shipped emails | Idempotency-Key per business event |
| FP2 | API → DB persist | DB down after accept | Producer got 500; event lost | Transactional outbox: DB + Kafka in one TX or outbox relay |
| FP3 | API → Kafka publish | Message not in bus | Request row stuck accepted never routed |
Outbox poller; reconciliation job scans stuck requests |
| FP4 | Router consumer | Crash after enqueue push but before commit offset | Duplicate push on redelivery | Unique (request_id, channel); idempotent router |
| FP5 | Preference / template lookup | Stale cache shows opted-in | User complaint; compliance risk | Short TTL + version on prefs; critical path reads primary |
| FP6 | Channel worker → provider | FCM timeout; unknown if sent | Status stuck pending |
Query provider receipt API; mark ambiguous; safe retry rules |
| FP7 | Provider → device | APNs token invalid | High failure rate for user | Invalidate token; fallback to email if policy allows |
| FP8 | Provider webhook ingress | Bounce webhook lost | Send to dead addresses repeatedly | Webhook retries; sync suppression list job |
| FP9 | Scheduler / delayed queue | Clock skew; lost delayed jobs | Reminder never sent | Persistent scheduled table + scanner; not only memory queue |
| FP10 | Broadcast fan-out | Segment job overwhelms cluster | Critical OTP lag hours | Isolate marketing topic; rate limit; autoscale workers |
flowchart LR
P[Producer] -->|FP1| API[Notification API]
API -->|FP2| DB[(DB)]
API -->|FP3| K[Kafka]
K -->|FP4| R[Router]
R -->|FP5| PREF[Preferences]
R --> W[Channel workers]
W -->|FP6| PR[FCM / SES / SMS]
PR -->|FP7| DEV[User device]
PR -->|FP8| WH[Provider webhooks]
SCH[Scheduler] -->|FP9| K
MKT[Marketing fan-out] -->|FP10| K
8. Failure modes
Failure modes describe recurring failure patterns across those points—what operators see, why it happens, and the safe system response.
8.1 Duplicate notification (at-least-once everywhere)
Symptom: User receives three “order shipped” pushes.
Cause: Producer retry without idempotency; router redelivery; worker retry without dedup key.
Safe response: Idempotency on request; unique delivery row per channel; FCM collapse key for same logical event.
8.2 Lost notification (silent drop)
Symptom: User never got OTP; logs show accepted then nothing.
Cause: FP2/FP3 — not persisted or not published to Kafka.
Safe response: Outbox pattern; alert on requests accepted > 5 min without delivery row; reconciliation scanner.
8.3 Poison message (bad payload)
Symptom: Consumer stuck retrying same message; lag grows.
Cause: Invalid template variable; schema drift.
Safe response: Validate at API; DLQ after N failures; schema registry for templates.
8.4 Provider throttle (rate limit)
Symptom: Mass 429 from SES; delivery delay hours.
Cause: FP10 marketing blast exceeds TPS quota.
Safe response: Token bucket per provider; shed marketing first; request quota increase; shard across multiple ESP sub-accounts carefully (compliance).
8.5 Priority inversion
Symptom: OTP delayed; marketing emails went out fine earlier.
Cause: Shared worker pool saturated by campaign.
Safe response: Separate topics and worker deployments per priority; WFQ weights.
8.6 Stale device token
Symptom: Push always fails for user; no fallback.
Cause: FP7 — user reinstalled app; old token not updated.
Safe response: Invalidate on permanent error; prompt re-register token; optional email fallback for critical.
8.7 Quiet hours violation
Symptom: Marketing push at 2am local.
Cause: FP5 — cached prefs; scheduled job used UTC not user TZ.
Safe response: Re-evaluate quiet hours at send time in user timezone; defer non-critical.
8.8 Ambiguous provider timeout
Symptom: Unknown if email sent; retry may duplicate.
Cause: FP6 — HTTP timeout to SendGrid.
Safe response: Status unknown; query provider by message id before retry; idempotent send API if available.
8.9 In-app feed out of sync
Symptom: Push received but inbox empty.
Cause: In-app worker failed while push succeeded.
Safe response: Independent delivery rows per channel; partial success is valid state; UI merges channels.
8.10 DLQ pile-up (cascading backlog)
Symptom: Millions in DLQ after provider outage.
Cause: All workers failed together; replay floods on recovery.
Safe response: Gradual DLQ replay with rate limit; priority order; extend retention on Kafka.
| Failure mode | Primary failure points | User impact | Core mitigation |
|---|---|---|---|
| Duplicate notification | FP1, FP4, FP6 | Spam | Idempotency + dedup keys |
| Lost notification | FP2, FP3 | Missed alert | Outbox + reconciliation |
| Poison message | FP4 | Backlog for all | DLQ + schema validation |
| Provider throttle | FP6, FP10 | Delay | Rate limits + priority queues |
| Priority inversion | FP10 | OTP late | Isolated critical path |
| Stale device token | FP7 | No push | Token hygiene + fallback |
| Quiet hours violation | FP5, FP9 | Trust loss | TZ-aware scheduler |
| Ambiguous timeout | FP6 | Duplicate or miss | Status probe before retry |
| Partial channel success | FP4, FP7 | Confusing UX | Per-channel status |
| DLQ pile-up | FP6 | Delayed catch-up | Throttled replay |
9. Scalability, availability, and security
9.1 Scalability
- Horizontally scale stateless API and channel workers.
- Partition Kafka by
user_idfor per-user ordering (all channels for one user serialized if needed). - Shard notification store by
user_idor time. - Batch push/email API calls to respect provider limits.
- Precompute segments for marketing offline (warehouse → object store → fan-out job).
9.2 Availability
- Multi-AZ Kafka and DB; cross-region read replicas for feed API.
- Degrade marketing under load; never drop critical topic consumers.
- Provider failover: secondary ESP for email if primary region down (advanced).
- Graceful API response:
202 Acceptedeven if delivery async—don’t block producer on FCM latency.
9.3 Security and compliance
- mTLS or signed internal tokens on Notification API.
- No PII in Kafka payloads if possible—send
user_idand resolve email server-side. - Encrypt device tokens at rest.
- Audit log for preference changes and marketing sends.
- Honor unsubscribe and GDPR erasure—delete feed + suppress address.
10. Tradeoffs recap
| Decision | Common choice | Why |
|---|---|---|
| Pull vs push model | Push to bus (async) | Decouple producer latency from FCM/SES |
| Consistency | Eventual delivery OK | Not financial ledger; favor availability |
| Dedup | Business idempotency key | Cheaper than exactly-once Kafka |
| Feed storage | Cassandra for inbox | High write/read per user |
| Exactly-once vs at-least-once | At-least-once + idempotent consumers | Simpler ops |
11. How to present this in 45 minutes
- 5 min — requirements; channels; transactional vs marketing priority.
- 7 min — capacity: 200M/day, ~11.5k peak/sec, channel split.
- 8 min — diagram: API → Kafka → router → workers → providers.
- 8 min — schema, templates, preferences, idempotency.
- 10 min — failure points + top failure modes (duplicate, lost, throttle, priority inversion).
- 7 min — retries, DLQ, scheduling, tradeoffs.
The one line to remember
A notification system is an async fan-out pipeline: accept events durably, route with preferences and templates, deliver per channel with retries—and use idempotency everywhere so at-least-once infrastructure does not become at-least-three-times spam for the user.