How Slack works at scale
Team chat looks like typing in a box. At scale it is a workspace-scoped event system: messages must land in the right channel or DM, show up in milliseconds for online teammates, stay searchable for years, and respect enterprise permissions—while bots and integrations fire in parallel.
We work through the design in order—requirements first, numbers second, architecture third, APIs last—using a Slack-class product as the mental model, not any one company’s private implementation.
What you should be able to do after reading:
- Separate the three loops—messaging, search, and workspace/access—and map services to each.
- List functional and non-functional requirements for members, admins, and apps.
- Walk one message: post → persist → fan-out → WebSocket push → mobile notification.
- Explain channels vs threads vs DMs,
client_msg_iddedupe, and why search is async. - Read the technical section: Web API methods, Events API, and Socket Mode payloads.
Step 0 — How we will work through the problem
Ordered thinking beats memorizing a purple screenshot. Use this sequence when you design team messaging:
- Clarify scope. Chat only, or calls/huddles? Enterprise Grid? Slack Connect shared channels? E2EE?
- Write requirements. Functional = post, thread, notify. Non-functional = delivery latency, retention, compliance export.
- Do napkin math. Messages per workspace per day, WebSocket connections, index growth.
- Draw three loops before naming Vitess or Elasticsearch.
- Tell one story—@mention in #engineering—then edit, delete, and offline mobile push.
flowchart LR
subgraph msg [Messaging loop]
POST[chat.postMessage] --> STORE[(Message store)]
STORE --> FAN[Fan-out]
FAN --> WS[WebSocket gateway]
end
subgraph search [Search loop]
STORE --> IDX[Index pipeline]
IDX --> ES[(Search cluster)]
end
subgraph access [Workspace loop]
ACL[Permissions] --> POST
ACL --> WS
APP[Apps / bots] --> POST
end
Step 1 — Functional requirements (members, admins, apps)
| Actor | Requirement | Why scale makes it hard |
|---|---|---|
| Member | Channels (public/private), DMs, group DMs | Membership lists drive fan-out |
| Member | Threads, reactions, emoji, formatting (Block Kit) | Extra rows + render graph per message |
| Member | @user, @channel, @here mentions | Notification storms in large channels |
| Member | Edit/delete within policy window | Tombstones + index updates |
| Member | Files, snippets, link unfurls | Async preview fetchers; virus scan |
| Member | Search messages and files | Inverted index lag behind write |
| Admin | Roles, retention, legal hold, export | Compliance vs delete-for-everyone |
| Admin | SSO, SCIM provisioning, audit logs | Enterprise Grid federation |
| App | Bots, slash commands, interactive buttons | Outbound webhooks with retries |
| External | Slack Connect / shared channels | Cross-org ACL and data residency |
Functional details worth stating clearly
Conversation id is the shard key. Channel C123, DM D456, thread parent ts—ordering is per conversation.
Events vs REST. Clients mutate via Web API; realtime updates arrive as typed events over WebSocket.
Out of scope today (say it aloud). Building Zoom from scratch inside chat, or full E2EE for all workspaces—state if excluded.
Step 2 — Non-functional requirements (engineering promises)
| Category | Target (typical) | How we meet it | If we miss it |
|---|---|---|---|
| Latency — message visible | p95 < 500 ms online | WebSocket push after durable write | “Slack feels slow” |
| Latency — search | Seconds acceptable | Async indexer; near-real-time | Users cannot find decisions |
| Durability | No lost acknowledged posts | Write-before-ACK; replicated DB | Lost incident timelines |
| Order | Per-conversation total order | ts timestamps with tie-break | Scrambled thread reads |
| Availability | 99.9%+ chat monthly | Regional gateways, DB failover | Work stops |
| Scale — large channel | #general with 10k members | Mention limits, fan-out batching, read-only modes | Notification + WS meltdown |
| Security | Token scopes, workspace isolation | OAuth bot scopes; Grid boundaries | Data leak across teams |
Key idea: Realtime delivery and searchable history pull in opposite directions—optimize the hot path for WebSockets; accept seconds of search lag with clear UX.
Step 3 — Napkin math (messages, sockets, and index)
- ~750k+ organizations on Slack-scale products (order of magnitude); many workspaces 50–5000 users.
- Active user sends 30–80 messages/day in heavy teams → millions of messages/hour globally.
- Each message ~0.5–2 KB metadata + text; files are MB objects referenced by id.
- WebSockets: 500k concurrent connections might need hundreds of gateway nodes with sticky routing.
- Search index size ≈ message text + extracted file content—often comparable to primary store over years.
Step 4 — Architecture: three loops
Edge API authenticates OAuth tokens. Chat service validates ACL, assigns ts, writes message row.
Fan-out service determines recipients (channel members minus mutes), pushes to gateway sessions and notification queue.
Search pipeline consumes events into Elasticsearch/OpenSearch. File service handles uploads to object storage.
flowchart TB
subgraph clients [Clients]
DESK[Desktop / web]
MOB[Mobile]
end
subgraph edge [Edge]
LB[Load balancer]
API[Web API]
GW[Realtime gateway]
end
subgraph core [Core]
CHAT[Chat / messages]
GRAPH[Membership]
NOTIF[Notifications]
FILES[Files]
end
subgraph data [Data]
DB[(Sharded SQL / NoSQL)]
R[("Redis presence")]
OBJ[(Object store)]
SRCH[(Search index)]
end
DESK --> LB
MOB --> LB
LB --> API --> CHAT
CHAT --> DB
CHAT --> GRAPH
CHAT --> GW
CHAT --> NOTIF
API --> FILES --> OBJ
CHAT -.->|message.created| SRCH
GW --> R
Step 5 — Walk one message end to end
User posts “Deploying v2.3” in #releases with @here.
- Client generates
client_msg_id; callschat.postMessagewithchannel,blocks,text. - API checks bot/user token scopes, channel membership, posting permissions.
- Chat service allocates
ts(microsecond-ish string sortable), inserts row, returnsok: true. - Fan-out loads channel member ids; filters muted/absent; enqueues gateway pushes and mobile push jobs.
- Gateway emits
messageevent to online sessions subscribed to workspace + channel. - Indexer async indexes text for search; link unfurl worker fetches OG tags in background.
- @here — notification service applies policy (only active members? rate limit?) before APNS/FCM.
sequenceDiagram
participant C as Client
participant A as Web API
participant M as Chat service
participant G as Gateway
participant N as Notify
C->>A: chat.postMessage
A->>M: ACL + insert
M-->>A: ts + ok
A-->>C: 200 response
M->>G: message event
M->>N: push jobs
G-->>C: WebSocket message
Step 6 — Conversations: channels, threads, and DMs
- Channel — many members; history visible per join policy; public vs private discovery rules.
- Thread — child messages share
thread_tsparent; reply count badge on parent. - DM / MPIM — 2–9 users; membership changes rewrite ACL; no @channel storms.
- Shared channels — bridge two workspaces; dual ACL evaluation.
messages( team_id, channel_id, ts, -- primary key with channel_id user_id, text, blocks_json, thread_ts, -- null if top-level client_msg_id, deleted )
Step 7 — Realtime gateway: WebSockets and events
Desktop/web open a WebSocket after apps.connections.open or legacy RTM URL with token.
Server pushes JSON events: message, message_changed, user_typing, reaction_added.
- Sticky routing — connection tied to gateway node; presence in Redis.
- Backpressure — slow clients drop typing first, not acked messages.
- Socket Mode — apps receive events over WebSocket instead of public HTTP callbacks.
Step 8 — Storage, ordering, and deduplication
ts per channel provides descending sort for history conversations.history.
client_msg_id unique per user prevents double-post on retry.
Edits create message_changed with new blocks; deletes set deleted tombstone.
Sharding: by team_id or channel_id to keep hot channels isolated.
Step 9 — Fan-out, mentions, and notifications
| Trigger | Who gets notified | Guardrail |
|---|---|---|
| @user | That user (if in channel) | Respect DND schedule |
| @channel | All members | Confirm prompt in large channels |
| @here | Active members only | Rate limit per sender |
| Thread reply | Thread participants + channel watchers | Configurable |
Mobile push goes through platform gateways; desktop may use OS notifications while app backgrounded. Badge counts sync from unread state service (per channel DM aggregate).
Step 10 — Search and compliance export
Search index stores message text, file names, extracted PDF text.
Queries scoped to team_id and visible channels for requesting user.
Enterprise — legal hold prevents hard delete; eDiscovery export to bulk files.
Index lag of 1–60s is normal; show “still indexing” for very new messages if needed.
Step 11 — Files, link unfurls, and Block Kit
files.upload → object storage + virus scan → share to channel with file_id.
Unfurl — when URL detected, async fetch OpenGraph; cache preview; respect robots.txt and SSRF guards.
Block Kit — JSON layout for rich messages; apps render buttons that POST to interaction endpoints.
Step 12 — Permissions, Enterprise Grid, and audit
- Roles — Owner, Admin, Member, Guest (single/multi-channel).
- Scopes — OAuth bot tokens limited to
chat:write,channels:read, etc. - Grid — org of many workspaces; shared channels; centralized admin.
- Audit logs — admin actions, app installs, export events to SIEM.
Step 13 — Apps: Events API, interactivity, workflows
Events API — HTTP POST to your server with signing secret verification.
Interactivity — button clicks return payload within 3s ack window; use response_url for async updates.
Workflow Builder — no-code triggers; still backed by same event bus.
# Verify Slack signature (conceptual)
basestring = f"v0:{timestamp}:{raw_body}"
expected = "v0=" + hmac_sha256(signing_secret, basestring)
timing_safe_equal(expected, header_signature)
Step 14 — Presence, typing, and huddles (adjacent)
Presence — active/away/auto; ephemeral in Redis; not durable history. Typing — fire-and-forget events; drop under load. Huddles/calls — separate media SFU path; signaling may start in chat but bytes do not traverse message store.
Step 15 — Technical layer: Web API methods
| Method | Purpose |
|---|---|
chat.postMessage | Send message to channel/DM |
conversations.history | Paginate messages (cursor, limit) |
conversations.replies | Thread messages |
search.messages | Query index (query, filters) |
users.conversations | List channels/DMs for user |
reactions.add | Emoji reaction on ts |
Post message (illustrative HTTP):
POST https://slack.com/api/chat.postMessage
Authorization: Bearer xoxb-…
Content-Type: application/json
{
"channel": "C01234567",
"text": "Deploying v2.3",
"blocks": [ … ],
"client_msg_id": "uuid-4f2a-…"
}
→ 200
{ "ok": true, "channel": "C01234567", "ts": "1716123456.789012", "message": { … } }
WebSocket event envelope (simplified):
{
"type": "event_callback",
"event": {
"type": "message",
"channel": "C01234567",
"user": "U09…",
"text": "Deploying v2.3",
"ts": "1716123456.789012"
}
}
Step 16 — Reliability, observability, and failure modes
- Duplicate posts — enforce
client_msg_id; return originaltson retry. - Gateway partition — client reconnect +
conversations.historygap fill since lastts. - @channel abuse — rate limits, admin audit, disable in mega-channels.
- Indexer lag — monitor Kafka consumer lag; search SLA separate from chat SLA.
- Webhook retries — apps must dedupe by
event_id.
Metrics: post p95, WS connect success, fan-out queue depth, push delivery rate, search lag, API 429 rate.
Step 17 — Goals → knobs (quick reference)
| Goal | Knob |
|---|---|
| Messages feel instant | Write-then-push, regional gateways, lean event payloads |
| Find anything later | Search cluster capacity, file text extraction, retention policy |
| Safe enterprise | SSO, SCIM, Grid isolation, audit exports, DLP scanning |
| Survive #general | Mention gates, fan-out batching, optional slow mode |
| Rich integrations | Scoped bots, signed webhooks, Socket Mode for firewalled apps |
Step 18 — Close the loop (what to practice)
On a whiteboard: three loops, one chat.postMessage, show WebSocket fan-out vs search indexer.
Out loud: difference between channel message and thread reply; how @here differs from @channel.
With the technical section: trace post → ts → gateway event → search.messages minutes later.
The one line to remember
Slack-class systems are durable chat logs per conversation plus a realtime event fan-out layer and a search pipeline playing catch-up. Get ordering and ACL right on write; push fast to sockets; let search and unfurls be asynchronous.