How WhatsApp-style chat works at scale

A messenger at billion-user scale is not one database with a chat bubble on top. It is three problems glued together: a live path for tiny updates (text, receipts, typing), a fat path for photos and voice notes, and a durable path for data you cannot afford to lose.

We build the design in order—requirements first, numbers second, architecture third, APIs last. This uses a WhatsApp-class mental model: wire formats, ids, stores, caches, queues, encryption, and calls.

What you should be able to do after reading:

Separate the three paths—live, fat, durable—and assign the right store to each.
List functional and non-functional requirements and tie them to WebSockets, writers, and object storage.
Walk one message online and offline, naming client_msg_id, server_seq, and receipts.
Explain E2EE building blocks (X3DH, Double Ratchet) and what metadata the server still sees.
Read the technical section: HTTP, WebSocket frames, SQL/Cassandra, Kafka partition keys.

Step 0 — How we will work through the problem

Ordered thinking beats memorizing a diagram. Use this sequence when you design a global messenger:

Clarify scope. Text only today, or calls? Desktop/web, or mobile only? E2EE in scope or transport-only TLS?
Write requirements. Functional = what users see. Non-functional = latency, durability, privacy, scale.
Do napkin math. Messages per day, connection count, bytes per row.
Draw three paths (live, fat, durable) before naming databases.
Tell one story—send with recipient online, then offline, then duplicate retry.

flowchart LR
  subgraph live [Live path]
    WS[WebSocket wss]
    CH[Chat hosts]
  end
  subgraph fat [Fat path]
    UP[Multipart upload]
    CDN[CDN + object store]
  end
  subgraph durable [Durable path]
    WR[Message writer]
    DB[(History store)]
  end
  WS --> CH --> WR --> DB

Step 1 — Functional requirements (what users need)

Area	Requirement	Why scale makes it hard
Chat	1:1 and group conversations	Groups fan out one send to N deliveries
Content	Text, files, photos, voice notes, optional video	Media must not ride the same socket as text
Receipts	Sent, delivered, read on the live path	High rate of tiny control frames
Offline	Server holds mail until device wakes	Inbox + sync on reconnect
Identity	Phone signup, devices, profile	Strong consistency on accounts
Calls (optional)	Voice/video via signaling + RTP	Separate from chat tier capacity

Functional details easy to skip (but worth stating clearly)

Order is per thread. server_seq is meaningful only inside one thread_id.

Dedupe is mandatory. client_msg_id prevents double-tap and retry duplicates.

Receipts are separate events on the live wire—delivered is not the same as read.

Step 2 — Non-functional requirements (engineering promises)

Category	Target (typical)	How we meet it	If we miss it
Latency — send	p99 < 500 ms send→ack	Write-before-ack, regional hosts	Feels sluggish
Latency — live push	Sub-second when online	WebSocket + connection registry	Late receipts
Availability	99.9%+ monthly (often cited)	Redundant fronts, health checks	Outages make news
Durability	Very low message loss	Quorum writes, backups	Permanent trust loss
Order	Total order per thread	Sequencer / single writer lane	Scrambled history
Privacy	TLS; optional E2EE	Signal-style protocols	Legal and user risk
Bandwidth	Tiny control frames; CDN media	Client compression, edge cache	Churn on metered plans

Key idea: Profiles want correctness, history wants write throughput, Redis can restart empty if disk is the source of truth.

Step 3 — Napkin math (why one server is not enough)

Round numbers. Multiply in the open—you are showing magnitude, not precision accounting.

~2 billion monthly active users on large messengers.
~100 billion events/day (messages, receipts, pings).
80B metadata rows/day × 160 bytes ≈ 13 TB/day before media bytes.
1.5B peak sessions ÷ 5M sessions/host ≈ hundreds of connection machines (before regions and headroom).

Honest ranges beat fake precision—the fleet size moves with heartbeat size and framing choices.

Step 4 — Architecture: the big picture

Phones on the left. DNS and load balancer in the middle. Chat hosts hold long-lived WebSockets. Behind them: SQL for profiles, wide-column or sharded store for history, Redis for presence, object storage + CDN for media. Skinny control on the live wire; fat bytes on the side door.

flowchart TB
  subgraph clients["Your apps"]
    MOB[Phones]
    WEB[Web / desktop]
  end
  subgraph entry["Front door"]
    DNS[DNS / geo routing]
    LB[Load balancer]
  end
  subgraph chat["Chat tier"]
    C1[Chat host]
    C2[Chat host]
    C3[Chat host]
  end
  subgraph data["Data"]
    UDB[(Profiles + login)]
    MDB[(Message history)]
    RED[(Fast cache)]
    S3[(File bucket)]
    CDN[Edge file cache]
  end
  MOB --> DNS
  WEB --> DNS
  DNS --> LB
  LB --> C1
  LB --> C2
  LB --> C3
  C1 --> UDB
  C1 --> MDB
  C1 --> RED
  MOB -->|upload / download| CDN
  CDN --> S3
  C1 -.->|small pointer| S3

Step 5 — What each component does

Clients. The app draws bubbles, keeps a little local copy so scrolling feels smooth, and on many designs encrypts text before it leaves the pocket. The server is often a mail carrier, not a proofreader.

Load balancer. Spreads people across chat hosts. Often wants “sticky” routing so one user keeps landing on the same host until a controlled move happens. Health checks so dead machines stop receiving new friends.

Chat hosts. Hold the live wire, parse small packets, talk to databases and cache, push events to whoever is online. Some stacks use languages famous for many light threads; some use anything else that the team can run well. Focus on responsibilities, not language wars.

User database. Phone number, profile, device list, maybe privacy flags. Things you want strong rules around—classic SQL territory in many answers.

Message store. Huge write load, time order inside a chat matters. People often draw a wide-column store here, but you can say “partitioned log + workers” if that is what you know better.

Redis or similar. “Who is online right now,” “which host holds this socket,” “last fifty lines hot in RAM,” typing dots with a short expiry. If Redis wakes up empty, the world still survives because the real truth is on disk.

Object storage + CDN. Big encrypted blobs, thumbnails, later downloads from an edge city near the user.

Step 5 (continued) — Transport, stickiness, and heartbeats

WebSocket (wss:// over TLS) is the usual production choice: one TCP connection, full-duplex frames, small binary or JSON payloads for MESSAGE, ACK, PING/PONG. Alternatives you should name: HTTP/2 long poll or gRPC streaming for companies that hate holding sockets in the load balancer; custom binary framing on TLS when you want minimal overhead. Tradeoff: anything stateful needs session affinity or a shared connection registry (Redis hash: conn:{user_id} → {chat_host_id, channel_id} with TTL 60–120s refreshed on each client ping).

Typical client ping interval: 15–60 seconds; server closes idle sockets after 2–5× the ping window if no application traffic. Frame size caps (for example 64 KB) stop one evil client from starving others on the same process.

Load balancer algorithms: round-robin for stateless HTTP; consistent hash on user id (or subchannel id) for WebSocket so the same user lands on the same chat host until you drain. Health checks: TCP + optional HTTP /healthz returning 200 with build id and dependency pings (DB, Redis) with short timeouts.

Step 6 — Walk one message end to end

I tell the story in plain steps. I use fake names because “Alice and Bob” are old friends from crypto class.

Alice’s phone builds a random client_msg_id so double taps and bad Wi‑Fi do not become twin messages.
If the product uses end-to-end encryption, the phone locks the text before it rides the wire. The server sees a blob, not a love letter.
The blob rides a long-lived connection—think “open window” instead of “knock on the door every second.”
The chat host checks auth, picks the right conversation, hands work to a writer that assigns the next slot number and saves to disk before saying “saved” back to Alice.
A small directory says where Bob’s socket lives today. If Bob is awake, push now. If Bob is asleep, write an inbox row and maybe ask Apple or Google to ping the phone later.
Bob’s phone unlocks the blob locally, draws the bubble, sends “delivered.” If you support reads, another tiny message when the UI actually shows the line.

The drawing under this paragraph is the comic-book version. Read top to bottom. Notice again: writer talks to disk before it promises Alice anything permanent.

sequenceDiagram
  participant A as Alice phone
  participant E as Front session
  participant W as Message writer
  participant S as Storage
  participant B as Bob phone
  A->>E: Blob + ids
  E->>W: Trusted command
  W->>S: Save with next slot
  W->>A: Saved
  alt Bob online
    W->>B: Push blob
    B->>E: Delivered
    E->>W: Mark state
    W->>A: Bob got it
  else Bob offline
    W->>S: Queue for Bob
  end

flowchart LR
  subgraph phones["Devices"]
    PA[Alice]
    PB[Bob]
  end
  subgraph front["Cloud front"]
    LB[Balancer]
    SESS[Live sessions]
  end
  subgraph back["Cloud back"]
    WR[Writers]
    DB[(History)]
    OBJ[(Files)]
  end
  PA <-->|live line| LB
  PB <-->|live line| LB
  LB --> SESS
  SESS --> WR
  WR --> DB
  PA -->|heavy upload| OBJ
  PB -->|heavy download| OBJ
  WR -.->|pointer only| DB

Step 7 — Online vs offline delivery

Bob awake. This is the happy path everyone loves. Push goes out now. Receipts fly back. Latency story is short and sweet.

Bob asleep. The message lands in durable storage with a clear owner. A push notification wakes the OS later. When Bob opens the app, the client pulls anything newer than its last slot number. You say “TTL” if you want to sound grown-up—that just means “we do not keep undownloaded server copies forever unless the law says we must.”

Most days are a mix. Your design should not break when the mix changes hour by hour.

Step 8 — Storage: match data to the right store

One tool rarely wins every job. Draw three buckets and I explain each in words a friend outside tech could repeat at dinner.

What you store	What you want	What people often say on the board
Accounts, contacts, settings	Strong rules, careful joins, “do not lose this”	SQL family (Postgres, MySQL, …)
Huge message history	Insane write rate, time order inside a chat, cheap growth sideways	Wide-column or sharded stores (Cassandra-style thinking, or your favorite)
“Who is online,” hot previews	Microsecond reads, okay if it vanishes on restart	Redis-style memory store
Photos and video bytes	Cheap disk, big pipes, CDN friendliness	Object store + CDN

If you only remember one sentence: match the tool to the shape of the pain. Profiles hurt if wrong; chat history hurts if slow; cache hurt if missing is fine.

Step 9 — Schemas, keys, and indexes

Relational core (Postgres-style)

Strong typing, foreign keys, and UNIQUE constraints give you correctness for money-adjacent data. Example sketch (types simplified):

CREATE TABLE users (
  user_id         BIGSERIAL PRIMARY KEY,
  phone_e164      TEXT NOT NULL UNIQUE,
  display_name    TEXT,
  created_at      TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE TABLE devices (
  device_id       UUID PRIMARY KEY,
  user_id         BIGINT NOT NULL REFERENCES users(user_id),
  push_token      TEXT,
  last_seen_at    TIMESTAMPTZ
);

CREATE TABLE threads (
  thread_id       UUID PRIMARY KEY,
  kind            TEXT NOT NULL CHECK (kind IN ('direct','group')),
  created_at      TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE TABLE thread_members (
  thread_id       UUID NOT NULL REFERENCES threads(thread_id),
  user_id         BIGINT NOT NULL REFERENCES users(user_id),
  role            TEXT NOT NULL DEFAULT 'member',
  member_version  BIGINT NOT NULL DEFAULT 1,
  PRIMARY KEY (thread_id, user_id)
);

CREATE TABLE messages (
  thread_id       UUID NOT NULL,
  server_seq      BIGINT NOT NULL,
  sender_id       BIGINT NOT NULL,
  client_msg_id   UUID NOT NULL,
  ciphertext      BYTEA,
  attachment_id   UUID,
  created_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
  PRIMARY KEY (thread_id, server_seq),
  UNIQUE (sender_id, client_msg_id)
);
CREATE INDEX idx_messages_thread_seq ON messages (thread_id, server_seq DESC);

Wide-column / log-style message store (Cassandra-style)

Partition by thread_id so all rows for one conversation sit together; cluster by message_id (time-UUID) descending for fast “latest first” scans. Example CQL shape:

CREATE TABLE messages_by_thread (
  thread_id       UUID,
  message_id      TIMEUUID,
  sender_id       BIGINT,
  ciphertext      BLOB,
  attachment_id   UUID,
  PRIMARY KEY ((thread_id), message_id)
) WITH CLUSTERING ORDER BY (message_id DESC);

TTL on message cells (for example 30 days on server-held ciphertext) is a product plus legal choice, not a physics law—state what you assume.

Redis keys (examples)

HSET session:{device_id} user_id 123 chat_host "c-7" conn_id "ch-9912a" EX 86400
SETEX presence:{user_id} 45 "online"
LPUSH recent:{thread_id} "{...}" LTRIM recent:{thread_id} 0 49

Use pipelines or Lua scripts when you must read presence and push in one round trip to cut tail latency.

Step 10 — Redis and the hot scratchpad

Redis (or anything like it) is the sticky note on the fridge. Online status, typing flashes, “which server holds Bob today,” maybe the last screen of messages for speed. The fridge note can fall on the floor; the wedding contract still lives in the filing cabinet (disk). Remember: do not store the only copy of a bank transfer inside Redis with no backup story.

Step 11 — End-to-end encryption (building blocks)

Modern messengers often follow the Signal protocol family. You do not derive every equation in a design review; you name the moving parts and what the server never sees.

X3DH (first message to a new device)

Long-term identity key, medium-term signed pre-key, pool of one-time pre-keys. Sender fetches a bundle from the server (public keys only), runs several Elliptic-Curve Diffie-Hellman exchanges, mixes outputs through a KDF to get a shared secret, then sends the first ciphertext plus the ephemeral public keys the receiver needs to recompute the same secret. Server stores public material only.

Double Ratchet (after the session exists)

Two ratchets: a symmetric chain advances per message for forward secrecy inside the session; a DH ratchet occasionally injects fresh entropy from new ECDH results (“self-healing” after compromise). Keys are deleted after use where the protocol specifies it—mention that explicitly; it is why old messages resist later key theft.

What the server can still see (metadata)

Usually ciphertext / opaque to server	Still visible for routing & ops
Message body bytes, attachment bytes when client-encrypted	Sender id, receiver id or thread id, timestamps, frame sizes, delivery state, IP / ASN for abuse, push tokens

That split drives legal retention, abuse tooling, and “can we search message text server-side?” (usually no, for strict E2EE products).

Step 12 — Groups: fan-out and tradeoffs

In the strongest privacy story, Alice’s phone may encrypt one copy per member, because each member has different keys. The server then does mail delivery: many small envelopes, same dinner reservation inside each scrambled shell.

That pattern is beautiful for privacy and painful for CPU and battery on huge groups. So real products compress, batch, and sometimes cap group size or degrade gracefully. Name that tradeoff plainly—large groups are where fan-out cost shows up in CPU, battery, and ops.

Complexity sketch: for N members, sender device performs O(N) encryptions and uploads O(N) ciphertext frames (or one batched RPC that expands server-side). Server CPU is mostly routing and fan-out to sockets, not decrypting. For N > 256, move to background fan-out workers fed from a queue keyed by thread_id so the HTTP or WS request returns fast with 202 Accepted + job_id optional.

flowchart TD
  POST[Alice posts to group]
  POST --> MEM[Load members]
  MEM --> EACH{Each member}
  EACH -->|online| LIVE[Push now]
  EACH -->|offline| SAVE[Save for later]
  EACH -->|no app open| PUSH[OS push ping]

Step 13 — Media: the side door

Compress on the phone when you can—people pay for data by the drop. Encrypt if that is your product promise. Ask the server for permission to upload (signed URL is the usual trick). Put bytes in the bucket. Send a skinny chat row that only carries a pointer, a thumbnail preview, and keys wrapped for the receiver.

Download often hits the CDN first so Mumbai and Montreal do not both yank the same file from one lonely Oregon disk. Say “origin vs edge” once; you sound like you shipped something before.

flowchart TB
  UI[Chat app]
  CHAT[Chat service]
  META[(Small metadata)]
  CDN[Nearby edge cache]
  BUCKET[(Central file store)]
  UI -->|1 ask for upload ticket| CHAT
  CHAT --> META
  UI -->|2 put bytes| CDN
  CDN --> BUCKET
  CHAT -.->|pointer + key ref| META

Upload mechanics (S3-style)

Large objects use multipart upload: typical part size 5–32 MB, parallel parts capped by CPU and memory on the phone. Server returns uploadId plus presigned PUT URLs per part; client sends Content-MD5 or checksum header if you want integrity without re-reading the whole file server-side. Complete with ordered ETag list; abort stale uploads after 24–48h.

Chat row stores attachment_id, content_type, byte_length, thumbnail_key, and a wrapped media key (key material encrypted with the per-chat symmetric ratchet, not sent in plain to object storage). CDN serves GET with short-lived signed cookies or query strings; set Cache-Control: public, max-age=… only for ciphertext blobs that are not user-private without auth—usually you keep CDN auth on.

Step 14 — Voice and video calls

Text rides small packets. A call is a river of RTP-style media packets with jitter buffers (20–80 ms target depth) and PLC for audio gaps.

Signaling path (over your existing TLS socket or a small HTTPS service): exchange SDP offer/answer (session description: codecs, ICE ufrag/pwd, fingerprint for DTLS-SRTP), trickle ICE candidates (host, srflx from STUN, relay from TURN). TURN allocates relay ports; credentials time out.

Media path: prefer UDP direct P2P; fallback to TURN over UDP then TCP. Opus for audio (16–32 kb/s voice), H.264 or VP8/VP9 for video with simulcast layers if you want adaptive bitrate without full renegotiation every second.

Your chat tier carries signaling + call state; it should not become a media relay except when TURN is in play—capacity plan TURN bandwidth separately (often $$$).

Step 15 — Growing the fleet

Geo DNS + anycast steer clients to a nearby front; health-checked records drop bad POPs.

Consistent hashing: map both nodes and keys onto a ring (0..2³²−1 or 0..2¹⁶⁰−1). Each physical node owns V virtual replicas (often 100–200) to smooth hot spots. Key k routes to first clockwise node ≥ hash(k). Adding a node moves only O(keys / (N·V)) of traffic in expectation—state that formula out loud.

Replication: message store often uses quorum writes (for example W=3, R=2 or LOCAL_QUORUM in one region) with RPO measured in seconds of unreplicated tail if you async-replicate cross region; RTO minutes if you automate failover. Say numbers you can defend, not magic zeros.

Cache layers: L1 SQLite on device (pages of history), L2 Redis hot set (<50 ms reads), L3 CDN for immutable ciphertext blobs with auth at edge.

Step 16 — Security engineering

Edge: TLS 1.3 only, strong cipher suites, OCSP stapling, optional certificate pinning in mobile binaries. mTLS between internal services (chat → writer, writer → DB) so a stolen east-west packet is not enough.

AuthN: short-lived JWT access tokens (5–15 min) + rotating refresh tokens bound to device id; store only hashes server-side. OAuth2 device flow if you have headless clients.

AuthZ: check thread_id membership on every write; use RBAC for admin APIs. Return 403 vs 404 deliberately so existence does not leak when privacy matters.

Rate limits: token bucket per user_id and per ip on login; sliding window on POST /messages (for example 60/min soft, 300/hour hard). Return 429 with Retry-After header.

Payload limits: max JSON body 64 KB for control messages; reject oversize frames at the WebSocket layer before JSON parse to avoid parser bombs.

Step 17 — Failure modes and idempotency

Duplicates. Networks lie. Build idempotent handlers—same id twice should not create two rows.

Clocks. Phones drift. Order by server slot numbers, not by “whatever time the handset felt like.”

Membership races. Version your roster so old packets cannot bring back someone you kicked.

Delete for real. Queues that walk replicas and CDN edges; say “eventual” instead of fairy tales.

Idempotent write pattern (SQL)

-- server_seq chosen in same txn via SELECT … FOR UPDATE on a thread row
-- or from a dedicated sequencer service; client_msg_id is the idempotency key
INSERT INTO messages (thread_id, server_seq, sender_id, client_msg_id, ciphertext)
VALUES ($1, $2, $3, $4, $5)
ON CONFLICT (sender_id, client_msg_id) DO NOTHING
RETURNING server_seq;

Pair with an outbox table (message_id, publish_state) if downstream search indexes must never miss a row: transactionally insert message + outbox row, worker publishes to Kafka, marks published.

Step 18 — Kafka ordering and partitions

For strict per-thread order, assign Kafka partition key = thread_id (or hash into a fixed set of hot partitions if one celebrity thread would saturate a single broker). One consumer per partition preserves order; scale consumers with partition count.

Producer settings: acks=all, enable.idempotence=true, max.in.flight.requests.per.connection=5 (with idempotence) to avoid reorder on retry. Consumer: process batch, commit offset only after durable write to Cassandra/Postgres.

If you cannot afford one partition per thread, use a sequencer service (per-thread single writer with compare-and-swap on expected_seq) and accept that as a central bottleneck you must shard by geography or thread popularity.

Step 19 — HTTP and WebSocket contracts

Control-plane REST complements the duplex socket. Typical verbs and status codes:

Operation	HTTP	Success	Common errors
Create thread	`POST /v1/threads`	`201` + `Location`	`400` bad body, `409` duplicate direct thread
List messages	`GET /v1/threads/{id}/messages?after_seq=&limit=50`	`200` + cursor header `X-Next-Seq`	`403` not a member, `416` if seq out of range (optional)
Send message (HTTP fallback)	`POST /v1/threads/{id}/messages`	`201`	`409` duplicate `client_msg_id`, `429` rate limit
Begin upload	`POST /v1/attachments:initiateMultipart`	`200` JSON with `uploadId`, part URLs	`413` quota exceeded

Example request body for send (E2EE product stores ciphertext only):

POST /v1/threads/7c9e…/messages
Authorization: Bearer eyJ…
Content-Type: application/json

{
  "client_msg_id": "2f5c1b8e-…",
  "kind": "text",
  "ciphertext": "base64-or-binary-in-json",
  "reply_to_seq": 1204
}

WebSocket binary opcode example (conceptual): 0x01 envelope = {type:"MSG", thread_id, server_seq, ciphertext}; 0x02 = {type:"ACK", client_msg_id, status:"DELIVERED"}. Version your schema (proto v3) so old apps do not explode.

Step 20 — Data model invariants

Tables you still draw: users, devices, threads, thread_members, messages, deliveries (optional per device), attachments, outbox_for_indexer (optional). Invariants worth saying aloud:

(sender_id, client_msg_id) is globally unique → dedupe retries.
(thread_id, server_seq) is unique → total order inside a thread.
Foreign keys from messages.thread_id and thread_members.thread_id to threads.
member_version (or epoch) on roster rows monotonic increases on admin edits → stale fan-out jobs abort.
Attachment row references object key + sha256 for dedupe across forwards.

Hot thread mitigation: shard by hash(thread_id) to N physical keyspaces; pin ultra-hot threads to dedicated partitions so they cannot evict cold neighbors from page cache.

Step 21 — Observability and SLOs

Golden signals: request rate, error rate (HTTP 5xx + WS abnormal close), latency histogram (p50/p95/p99) for send→ack and send→delivered, saturation (CPU, event loop lag, Kafka consumer lag, DB thread pool queue depth).

Example SLO targets for the control plane (tune to product): p99 send latency < 500 ms, availability 99.9% monthly, 0.001% durable message loss backed by quorum + backups. Alert on SLO burn rate over short windows, not only raw CPU.

Tracing: propagate trace_id from edge through writer to DB commit span; tag with thread_id, shard_id. Logs: structured JSON, never log ciphertext or raw tokens.

Step 22 — Goals → knobs (quick reference)

Goal in normal words	Knob you can actually turn
Feels instant	Regional hosts, small packets on the live wire, CDN for media, warm caches
Hard to lose	Write before you ack, backups, drills where you pretend a data center died
Stays readable in one chat	Rising slot per thread, single writer lane per hot shard, dedupe on client ids
Stays up on a bad day	Spare capacity, health checks, failover with honest RPO/RTO words
Does not leak like a sieve	TLS, rate limits, device-bound tokens, E2EE story if in scope

Step 23 — Close the loop (what to practice)

On a whiteboard: three paths, one message story (online + offline), label SQL vs Cassandra vs Redis vs S3.

Out loud: five functional requirements and which non-functional target applies to live vs durable paths.

With the technical section: trace POST /v1/threads/{id}/messages and the WebSocket MSG frame.

The one line to remember

A messenger is three systems: a live wire for control, a side door for media, and a durable ledger for history. Respect those boundaries and billion-user chat stays understandable.