sharpbyte.dev

How Slack works at scale

Team chat looks like typing in a box. At scale it is a workspace-scoped event system: messages must land in the right channel or DM, show up in milliseconds for online teammates, stay searchable for years, and respect enterprise permissions—while bots and integrations fire in parallel.

We work through the design in order—requirements first, numbers second, architecture third, APIs last—using a Slack-class product as the mental model, not any one company’s private implementation.

What you should be able to do after reading:

Step 0 — How we will work through the problem

Ordered thinking beats memorizing a purple screenshot. Use this sequence when you design team messaging:

  1. Clarify scope. Chat only, or calls/huddles? Enterprise Grid? Slack Connect shared channels? E2EE?
  2. Write requirements. Functional = post, thread, notify. Non-functional = delivery latency, retention, compliance export.
  3. Do napkin math. Messages per workspace per day, WebSocket connections, index growth.
  4. Draw three loops before naming Vitess or Elasticsearch.
  5. Tell one story—@mention in #engineering—then edit, delete, and offline mobile push.
flowchart LR
  subgraph msg [Messaging loop]
    POST[chat.postMessage] --> STORE[(Message store)]
    STORE --> FAN[Fan-out]
    FAN --> WS[WebSocket gateway]
  end
  subgraph search [Search loop]
    STORE --> IDX[Index pipeline]
    IDX --> ES[(Search cluster)]
  end
  subgraph access [Workspace loop]
    ACL[Permissions] --> POST
    ACL --> WS
    APP[Apps / bots] --> POST
  end
    

Step 1 — Functional requirements (members, admins, apps)

ActorRequirementWhy scale makes it hard
MemberChannels (public/private), DMs, group DMsMembership lists drive fan-out
MemberThreads, reactions, emoji, formatting (Block Kit)Extra rows + render graph per message
Member@user, @channel, @here mentionsNotification storms in large channels
MemberEdit/delete within policy windowTombstones + index updates
MemberFiles, snippets, link unfurlsAsync preview fetchers; virus scan
MemberSearch messages and filesInverted index lag behind write
AdminRoles, retention, legal hold, exportCompliance vs delete-for-everyone
AdminSSO, SCIM provisioning, audit logsEnterprise Grid federation
AppBots, slash commands, interactive buttonsOutbound webhooks with retries
ExternalSlack Connect / shared channelsCross-org ACL and data residency

Functional details worth stating clearly

Conversation id is the shard key. Channel C123, DM D456, thread parent ts—ordering is per conversation.

Events vs REST. Clients mutate via Web API; realtime updates arrive as typed events over WebSocket.

Out of scope today (say it aloud). Building Zoom from scratch inside chat, or full E2EE for all workspaces—state if excluded.

Step 2 — Non-functional requirements (engineering promises)

CategoryTarget (typical)How we meet itIf we miss it
Latency — message visiblep95 < 500 ms onlineWebSocket push after durable write“Slack feels slow”
Latency — searchSeconds acceptableAsync indexer; near-real-timeUsers cannot find decisions
DurabilityNo lost acknowledged postsWrite-before-ACK; replicated DBLost incident timelines
OrderPer-conversation total orderts timestamps with tie-breakScrambled thread reads
Availability99.9%+ chat monthlyRegional gateways, DB failoverWork stops
Scale — large channel#general with 10k membersMention limits, fan-out batching, read-only modesNotification + WS meltdown
SecurityToken scopes, workspace isolationOAuth bot scopes; Grid boundariesData leak across teams

Key idea: Realtime delivery and searchable history pull in opposite directions—optimize the hot path for WebSockets; accept seconds of search lag with clear UX.

Step 3 — Napkin math (messages, sockets, and index)

Step 4 — Architecture: three loops

Edge API authenticates OAuth tokens. Chat service validates ACL, assigns ts, writes message row. Fan-out service determines recipients (channel members minus mutes), pushes to gateway sessions and notification queue. Search pipeline consumes events into Elasticsearch/OpenSearch. File service handles uploads to object storage.

flowchart TB
  subgraph clients [Clients]
    DESK[Desktop / web]
    MOB[Mobile]
  end
  subgraph edge [Edge]
    LB[Load balancer]
    API[Web API]
    GW[Realtime gateway]
  end
  subgraph core [Core]
    CHAT[Chat / messages]
    GRAPH[Membership]
    NOTIF[Notifications]
    FILES[Files]
  end
  subgraph data [Data]
    DB[(Sharded SQL / NoSQL)]
    R[("Redis presence")]
    OBJ[(Object store)]
    SRCH[(Search index)]
  end
  DESK --> LB
  MOB --> LB
  LB --> API --> CHAT
  CHAT --> DB
  CHAT --> GRAPH
  CHAT --> GW
  CHAT --> NOTIF
  API --> FILES --> OBJ
  CHAT -.->|message.created| SRCH
  GW --> R
    

Step 5 — Walk one message end to end

User posts “Deploying v2.3” in #releases with @here.

  1. Client generates client_msg_id; calls chat.postMessage with channel, blocks, text.
  2. API checks bot/user token scopes, channel membership, posting permissions.
  3. Chat service allocates ts (microsecond-ish string sortable), inserts row, returns ok: true.
  4. Fan-out loads channel member ids; filters muted/absent; enqueues gateway pushes and mobile push jobs.
  5. Gateway emits message event to online sessions subscribed to workspace + channel.
  6. Indexer async indexes text for search; link unfurl worker fetches OG tags in background.
  7. @here — notification service applies policy (only active members? rate limit?) before APNS/FCM.
sequenceDiagram
  participant C as Client
  participant A as Web API
  participant M as Chat service
  participant G as Gateway
  participant N as Notify
  C->>A: chat.postMessage
  A->>M: ACL + insert
  M-->>A: ts + ok
  A-->>C: 200 response
  M->>G: message event
  M->>N: push jobs
  G-->>C: WebSocket message
    

Step 6 — Conversations: channels, threads, and DMs

messages(
  team_id,
  channel_id,
  ts,              -- primary key with channel_id
  user_id,
  text,
  blocks_json,
  thread_ts,       -- null if top-level
  client_msg_id,
  deleted
)

Step 7 — Realtime gateway: WebSockets and events

Desktop/web open a WebSocket after apps.connections.open or legacy RTM URL with token. Server pushes JSON events: message, message_changed, user_typing, reaction_added.

Step 8 — Storage, ordering, and deduplication

ts per channel provides descending sort for history conversations.history. client_msg_id unique per user prevents double-post on retry. Edits create message_changed with new blocks; deletes set deleted tombstone.

Sharding: by team_id or channel_id to keep hot channels isolated.

Step 9 — Fan-out, mentions, and notifications

TriggerWho gets notifiedGuardrail
@userThat user (if in channel)Respect DND schedule
@channelAll membersConfirm prompt in large channels
@hereActive members onlyRate limit per sender
Thread replyThread participants + channel watchersConfigurable

Mobile push goes through platform gateways; desktop may use OS notifications while app backgrounded. Badge counts sync from unread state service (per channel DM aggregate).

Step 10 — Search and compliance export

Search index stores message text, file names, extracted PDF text. Queries scoped to team_id and visible channels for requesting user. Enterprise — legal hold prevents hard delete; eDiscovery export to bulk files. Index lag of 1–60s is normal; show “still indexing” for very new messages if needed.

Step 11 — Files, link unfurls, and Block Kit

files.upload → object storage + virus scan → share to channel with file_id. Unfurl — when URL detected, async fetch OpenGraph; cache preview; respect robots.txt and SSRF guards. Block Kit — JSON layout for rich messages; apps render buttons that POST to interaction endpoints.

Step 12 — Permissions, Enterprise Grid, and audit

Step 13 — Apps: Events API, interactivity, workflows

Events API — HTTP POST to your server with signing secret verification. Interactivity — button clicks return payload within 3s ack window; use response_url for async updates. Workflow Builder — no-code triggers; still backed by same event bus.

# Verify Slack signature (conceptual)
basestring = f"v0:{timestamp}:{raw_body}"
expected = "v0=" + hmac_sha256(signing_secret, basestring)
timing_safe_equal(expected, header_signature)

Step 14 — Presence, typing, and huddles (adjacent)

Presence — active/away/auto; ephemeral in Redis; not durable history. Typing — fire-and-forget events; drop under load. Huddles/calls — separate media SFU path; signaling may start in chat but bytes do not traverse message store.

Step 15 — Technical layer: Web API methods

MethodPurpose
chat.postMessageSend message to channel/DM
conversations.historyPaginate messages (cursor, limit)
conversations.repliesThread messages
search.messagesQuery index (query, filters)
users.conversationsList channels/DMs for user
reactions.addEmoji reaction on ts

Post message (illustrative HTTP):

POST https://slack.com/api/chat.postMessage
Authorization: Bearer xoxb-…
Content-Type: application/json

{
  "channel": "C01234567",
  "text": "Deploying v2.3",
  "blocks": [ … ],
  "client_msg_id": "uuid-4f2a-…"
}

→ 200
{ "ok": true, "channel": "C01234567", "ts": "1716123456.789012", "message": { … } }

WebSocket event envelope (simplified):

{
  "type": "event_callback",
  "event": {
    "type": "message",
    "channel": "C01234567",
    "user": "U09…",
    "text": "Deploying v2.3",
    "ts": "1716123456.789012"
  }
}

Step 16 — Reliability, observability, and failure modes

Metrics: post p95, WS connect success, fan-out queue depth, push delivery rate, search lag, API 429 rate.

Step 17 — Goals → knobs (quick reference)

GoalKnob
Messages feel instantWrite-then-push, regional gateways, lean event payloads
Find anything laterSearch cluster capacity, file text extraction, retention policy
Safe enterpriseSSO, SCIM, Grid isolation, audit exports, DLP scanning
Survive #generalMention gates, fan-out batching, optional slow mode
Rich integrationsScoped bots, signed webhooks, Socket Mode for firewalled apps

Step 18 — Close the loop (what to practice)

On a whiteboard: three loops, one chat.postMessage, show WebSocket fan-out vs search indexer.

Out loud: difference between channel message and thread reply; how @here differs from @channel.

With the technical section: trace post → ts → gateway event → search.messages minutes later.

The one line to remember

Slack-class systems are durable chat logs per conversation plus a realtime event fan-out layer and a search pipeline playing catch-up. Get ordering and ACL right on write; push fast to sockets; let search and unfurls be asynchronous.