How Slack works at scale

Team chat looks like typing in a box. At scale it is a workspace-scoped event system: messages must land in the right channel or DM, show up in milliseconds for online teammates, stay searchable for years, and respect enterprise permissions—while bots and integrations fire in parallel.

We work through the design in order—requirements first, numbers second, architecture third, APIs last—using a Slack-class product as the mental model, not any one company’s private implementation.

What you should be able to do after reading:

Separate the three loops—messaging, search, and workspace/access—and map services to each.
List functional and non-functional requirements for members, admins, and apps.
Walk one message: post → persist → fan-out → WebSocket push → mobile notification.
Explain channels vs threads vs DMs, client_msg_id dedupe, and why search is async.
Read the technical section: Web API methods, Events API, and Socket Mode payloads.

Step 0 — How we will work through the problem

Ordered thinking beats memorizing a purple screenshot. Use this sequence when you design team messaging:

Clarify scope. Chat only, or calls/huddles? Enterprise Grid? Slack Connect shared channels? E2EE?
Write requirements. Functional = post, thread, notify. Non-functional = delivery latency, retention, compliance export.
Do napkin math. Messages per workspace per day, WebSocket connections, index growth.
Draw three loops before naming Vitess or Elasticsearch.
Tell one story—@mention in #engineering—then edit, delete, and offline mobile push.

flowchart LR
  subgraph msg [Messaging loop]
    POST[chat.postMessage] --> STORE[(Message store)]
    STORE --> FAN[Fan-out]
    FAN --> WS[WebSocket gateway]
  end
  subgraph search [Search loop]
    STORE --> IDX[Index pipeline]
    IDX --> ES[(Search cluster)]
  end
  subgraph access [Workspace loop]
    ACL[Permissions] --> POST
    ACL --> WS
    APP[Apps / bots] --> POST
  end

Step 1 — Functional requirements (members, admins, apps)

Actor	Requirement	Why scale makes it hard
Member	Channels (public/private), DMs, group DMs	Membership lists drive fan-out
Member	Threads, reactions, emoji, formatting (Block Kit)	Extra rows + render graph per message
Member	@user, @channel, @here mentions	Notification storms in large channels
Member	Edit/delete within policy window	Tombstones + index updates
Member	Files, snippets, link unfurls	Async preview fetchers; virus scan
Member	Search messages and files	Inverted index lag behind write
Admin	Roles, retention, legal hold, export	Compliance vs delete-for-everyone
Admin	SSO, SCIM provisioning, audit logs	Enterprise Grid federation
App	Bots, slash commands, interactive buttons	Outbound webhooks with retries
External	Slack Connect / shared channels	Cross-org ACL and data residency

Functional details worth stating clearly

Conversation id is the shard key. Channel C123, DM D456, thread parent ts—ordering is per conversation.

Events vs REST. Clients mutate via Web API; realtime updates arrive as typed events over WebSocket.

Out of scope today (say it aloud). Building Zoom from scratch inside chat, or full E2EE for all workspaces—state if excluded.

Step 2 — Non-functional requirements (engineering promises)

Category	Target (typical)	How we meet it	If we miss it
Latency — message visible	p95 < 500 ms online	WebSocket push after durable write	“Slack feels slow”
Latency — search	Seconds acceptable	Async indexer; near-real-time	Users cannot find decisions
Durability	No lost acknowledged posts	Write-before-ACK; replicated DB	Lost incident timelines
Order	Per-conversation total order	`ts` timestamps with tie-break	Scrambled thread reads
Availability	99.9%+ chat monthly	Regional gateways, DB failover	Work stops
Scale — large channel	#general with 10k members	Mention limits, fan-out batching, read-only modes	Notification + WS meltdown
Security	Token scopes, workspace isolation	OAuth bot scopes; Grid boundaries	Data leak across teams

Key idea: Realtime delivery and searchable history pull in opposite directions—optimize the hot path for WebSockets; accept seconds of search lag with clear UX.

Step 3 — Napkin math (messages, sockets, and index)

~750k+ organizations on Slack-scale products (order of magnitude); many workspaces 50–5000 users.
Active user sends 30–80 messages/day in heavy teams → millions of messages/hour globally.
Each message ~0.5–2 KB metadata + text; files are MB objects referenced by id.
WebSockets: 500k concurrent connections might need hundreds of gateway nodes with sticky routing.
Search index size ≈ message text + extracted file content—often comparable to primary store over years.

Step 4 — Architecture: three loops

Edge API authenticates OAuth tokens. Chat service validates ACL, assigns ts, writes message row. Fan-out service determines recipients (channel members minus mutes), pushes to gateway sessions and notification queue. Search pipeline consumes events into Elasticsearch/OpenSearch. File service handles uploads to object storage.

flowchart TB
  subgraph clients [Clients]
    DESK[Desktop / web]
    MOB[Mobile]
  end
  subgraph edge [Edge]
    LB[Load balancer]
    API[Web API]
    GW[Realtime gateway]
  end
  subgraph core [Core]
    CHAT[Chat / messages]
    GRAPH[Membership]
    NOTIF[Notifications]
    FILES[Files]
  end
  subgraph data [Data]
    DB[(Sharded SQL / NoSQL)]
    R[("Redis presence")]
    OBJ[(Object store)]
    SRCH[(Search index)]
  end
  DESK --> LB
  MOB --> LB
  LB --> API --> CHAT
  CHAT --> DB
  CHAT --> GRAPH
  CHAT --> GW
  CHAT --> NOTIF
  API --> FILES --> OBJ
  CHAT -.->|message.created| SRCH
  GW --> R

Step 5 — Walk one message end to end

User posts “Deploying v2.3” in #releases with @here.

Client generates client_msg_id; calls chat.postMessage with channel, blocks, text.
API checks bot/user token scopes, channel membership, posting permissions.
Chat service allocates ts (microsecond-ish string sortable), inserts row, returns ok: true.
Fan-out loads channel member ids; filters muted/absent; enqueues gateway pushes and mobile push jobs.
Gateway emits message event to online sessions subscribed to workspace + channel.
Indexer async indexes text for search; link unfurl worker fetches OG tags in background.
@here — notification service applies policy (only active members? rate limit?) before APNS/FCM.

sequenceDiagram
  participant C as Client
  participant A as Web API
  participant M as Chat service
  participant G as Gateway
  participant N as Notify
  C->>A: chat.postMessage
  A->>M: ACL + insert
  M-->>A: ts + ok
  A-->>C: 200 response
  M->>G: message event
  M->>N: push jobs
  G-->>C: WebSocket message

Step 6 — Conversations: channels, threads, and DMs

Channel — many members; history visible per join policy; public vs private discovery rules.
Thread — child messages share thread_ts parent; reply count badge on parent.
DM / MPIM — 2–9 users; membership changes rewrite ACL; no @channel storms.
Shared channels — bridge two workspaces; dual ACL evaluation.

messages(
  team_id,
  channel_id,
  ts,              -- primary key with channel_id
  user_id,
  text,
  blocks_json,
  thread_ts,       -- null if top-level
  client_msg_id,
  deleted
)

Step 7 — Realtime gateway: WebSockets and events

Desktop/web open a WebSocket after apps.connections.open or legacy RTM URL with token. Server pushes JSON events: message, message_changed, user_typing, reaction_added.

Sticky routing — connection tied to gateway node; presence in Redis.
Backpressure — slow clients drop typing first, not acked messages.
Socket Mode — apps receive events over WebSocket instead of public HTTP callbacks.

Step 8 — Storage, ordering, and deduplication

ts per channel provides descending sort for history conversations.history. client_msg_id unique per user prevents double-post on retry. Edits create message_changed with new blocks; deletes set deleted tombstone.

Sharding: by team_id or channel_id to keep hot channels isolated.

Step 9 — Fan-out, mentions, and notifications

Trigger	Who gets notified	Guardrail
@user	That user (if in channel)	Respect DND schedule
@channel	All members	Confirm prompt in large channels
@here	Active members only	Rate limit per sender
Thread reply	Thread participants + channel watchers	Configurable

Mobile push goes through platform gateways; desktop may use OS notifications while app backgrounded. Badge counts sync from unread state service (per channel DM aggregate).

Step 10 — Search and compliance export

Search index stores message text, file names, extracted PDF text. Queries scoped to team_id and visible channels for requesting user. Enterprise — legal hold prevents hard delete; eDiscovery export to bulk files. Index lag of 1–60s is normal; show “still indexing” for very new messages if needed.

Step 11 — Files, link unfurls, and Block Kit

files.upload → object storage + virus scan → share to channel with file_id. Unfurl — when URL detected, async fetch OpenGraph; cache preview; respect robots.txt and SSRF guards. Block Kit — JSON layout for rich messages; apps render buttons that POST to interaction endpoints.

Step 12 — Permissions, Enterprise Grid, and audit

Roles — Owner, Admin, Member, Guest (single/multi-channel).
Scopes — OAuth bot tokens limited to chat:write, channels:read, etc.
Grid — org of many workspaces; shared channels; centralized admin.
Audit logs — admin actions, app installs, export events to SIEM.

Step 13 — Apps: Events API, interactivity, workflows

Events API — HTTP POST to your server with signing secret verification. Interactivity — button clicks return payload within 3s ack window; use response_url for async updates. Workflow Builder — no-code triggers; still backed by same event bus.

# Verify Slack signature (conceptual)
basestring = f"v0:{timestamp}:{raw_body}"
expected = "v0=" + hmac_sha256(signing_secret, basestring)
timing_safe_equal(expected, header_signature)

Step 14 — Presence, typing, and huddles (adjacent)

Presence — active/away/auto; ephemeral in Redis; not durable history. Typing — fire-and-forget events; drop under load. Huddles/calls — separate media SFU path; signaling may start in chat but bytes do not traverse message store.

Step 15 — Technical layer: Web API methods

Method	Purpose
`chat.postMessage`	Send message to channel/DM
`conversations.history`	Paginate messages (`cursor`, `limit`)
`conversations.replies`	Thread messages
`search.messages`	Query index (`query`, filters)
`users.conversations`	List channels/DMs for user
`reactions.add`	Emoji reaction on `ts`

Post message (illustrative HTTP):

POST https://slack.com/api/chat.postMessage
Authorization: Bearer xoxb-…
Content-Type: application/json

{
  "channel": "C01234567",
  "text": "Deploying v2.3",
  "blocks": [ … ],
  "client_msg_id": "uuid-4f2a-…"
}

→ 200
{ "ok": true, "channel": "C01234567", "ts": "1716123456.789012", "message": { … } }

WebSocket event envelope (simplified):

{
  "type": "event_callback",
  "event": {
    "type": "message",
    "channel": "C01234567",
    "user": "U09…",
    "text": "Deploying v2.3",
    "ts": "1716123456.789012"
  }
}

Step 16 — Reliability, observability, and failure modes

Duplicate posts — enforce client_msg_id; return original ts on retry.
Gateway partition — client reconnect + conversations.history gap fill since last ts.
@channel abuse — rate limits, admin audit, disable in mega-channels.
Indexer lag — monitor Kafka consumer lag; search SLA separate from chat SLA.
Webhook retries — apps must dedupe by event_id.

Metrics: post p95, WS connect success, fan-out queue depth, push delivery rate, search lag, API 429 rate.

Step 17 — Goals → knobs (quick reference)

Goal	Knob
Messages feel instant	Write-then-push, regional gateways, lean event payloads
Find anything later	Search cluster capacity, file text extraction, retention policy
Safe enterprise	SSO, SCIM, Grid isolation, audit exports, DLP scanning
Survive #general	Mention gates, fan-out batching, optional slow mode
Rich integrations	Scoped bots, signed webhooks, Socket Mode for firewalled apps

Step 18 — Close the loop (what to practice)

On a whiteboard: three loops, one chat.postMessage, show WebSocket fan-out vs search indexer.

Out loud: difference between channel message and thread reply; how @here differs from @channel.

With the technical section: trace post → ts → gateway event → search.messages minutes later.

The one line to remember

Slack-class systems are durable chat logs per conversation plus a realtime event fan-out layer and a search pipeline playing catch-up. Get ordering and ACL right on write; push fast to sockets; let search and unfurls be asynchronous.