How Google Docs works at scale
Collaborative editing looks like a word processor in the browser. At scale it is a distributed log of tiny edits that many people apply at once—plus permissions, revision history, and export pipelines that must never corrupt the document.
We work through the design in order—requirements first, numbers second, architecture third, APIs last—using a Google Docs–class product as the mental model, not any one company’s private implementation.
What you should be able to do after reading:
- Separate the three loops—collaboration, document persistence, and access control.
- List functional and non-functional requirements for editors, viewers, and admins.
- Walk one keystroke from client op → server transform → broadcast → durable revision.
- Contrast OT vs CRDT and explain revision numbers and server authority.
- Read the technical section: WebSocket frames, operation batches, and REST metadata APIs.
Step 0 — How we will work through the problem
Ordered thinking beats memorizing boxes. Use this sequence when you design real-time collaborative documents:
- Clarify scope. Docs only or Sheets/Slides too? Offline editing? Guest commenters? Enterprise DLP in scope?
- Write requirements. Functional = editing, sharing, history. Non-functional = sync latency, durability, never lose edits.
- Do napkin math. Active docs, ops per second per doc, revision row growth—so nobody stores the whole internet in one JSON blob.
- Draw three loops before naming WebSockets or Spanner.
- Tell one story—two users type in the same paragraph—then failure cases (partition, reconnect, conflicting paste).
flowchart TB
subgraph collab [Collaboration loop]
C1[Client A ops] --> WS[Realtime gateway]
C2[Client B ops] --> WS
WS --> OT[Transform / order]
OT --> FAN[Broadcast to sessions]
end
subgraph doc [Document loop]
OT --> LOG[Append revision log]
LOG --> SNAP[Periodic snapshots]
SNAP --> BLOB[(Object store)]
end
subgraph access [Access loop]
ACL[Permissions service] --> WS
ACL --> API[REST metadata]
SHARE[Share links] --> ACL
end
Step 1 — Functional requirements (editors, viewers, admins)
| Actor | Requirement | Why scale makes it hard |
|---|---|---|
| Editor | Type, format, insert images/tables, undo/redo | Every keystroke is an op; fan-out to N collaborators |
| Editor | See others’ cursors and selections in near real time | Presence channel separate from document ops |
| Viewer | Read-only open; copy allowed per policy | Same render path without accepting ops |
| Commenter | Comments and @mentions without editing body | Comment anchoring to volatile positions |
| Suggest mode | Proposed edits require owner accept/reject | Branching suggestion layer on top of canonical doc |
| Owner | Share by user/group/link; transfer ownership | ACL evaluation on every op and API call |
| All | Version history; restore named revision | Compaction vs infinite op log |
| All | Export PDF, DOCX, etc. | Async render farm reads snapshot + tail ops |
| Admin | Org policies, retention, legal hold, audit | eDiscovery exports across millions of files |
Functional details worth stating clearly
Operations are the source of truth, not the HTML DOM. The server stores a ordered sequence of ops (or periodic snapshots + tail).
Undo is local and global. Local undo stacks invert recent ops; server may still reject if revision moved.
Out of scope today (say it aloud). Building a full layout engine from scratch, real-time video co-editing, or on-prem blockchain audit—park them.
Step 2 — Non-functional requirements (engineering promises)
| Category | Target (typical) | How we meet it | If we miss it |
|---|---|---|---|
| Latency — remote edit visible | p95 < 200 ms same region | WebSocket, regional doc shards, small op payloads | Feels like email, not “live” |
| Latency — open doc | p95 < 2 s cold start | Latest snapshot + tail replay | Users think tab crashed |
| Durability | No acknowledged op lost | Write-ahead log before ACK to client | Trust destroyed forever |
| Consistency | Total order of ops per document | Single sequencer per doc shard | Garbled text, divergent forks |
| Availability | 99.9%+ edit path monthly | Replica gateways, doc shard failover | Classroom / deal room stops |
| Scale — hot doc | 100+ viewers; 10–20 concurrent editors | Op batching, read-only fan-out path | Gateway meltdown on viral doc |
| Scale — corpus | Billions of files | Shard by doc_id, tier cold storage | One DB owns everything |
| Security | Least privilege per op | ACL check at gateway + storage | Link leak edits entire company drive |
Key idea: Collaboration wants a single authoritative order per document. Analytics and search can be eventual; the edit log cannot.
Step 3 — Napkin math (why one JSON file is not enough)
- ~2B+ monthly active Google Workspace users (ecosystem scale—not all editing at once).
- Suppose 500M docs receive at least one edit per day; average 50 ops per active doc per day → 25B ops/day ≈ 300k ops/s globally (peaks higher in business hours).
- Each op might be 50–500 bytes JSON → tens of TB of log per day before compaction—snapshots and garbage collection are mandatory.
- A viral doc with 200 viewers and 20 editors could see 100+ ops/s—one hot shard; needs batching and possibly dedicated collaboration host.
- Images in docs: multi-MB objects in blob store; ops only reference
image_id.
Step 4 — Architecture: three loops
Browser clients talk to an edge API for metadata (title, permissions) and a realtime gateway for ops.
Each document maps to a shard with a sequencer that assigns monotonic revision numbers.
The document loop appends to a write-ahead log, compacts into snapshots, and stores large blobs separately.
flowchart TB
subgraph clients [Clients]
WEB[Browser editor]
MOB[Mobile app]
end
subgraph edge [Edge]
LB[Load balancer]
META[Metadata API]
RT[Realtime gateway]
end
subgraph core [Document shard]
SEQ[Sequencer]
OT[OT / CRDT engine]
WAL[(Revision log)]
SNAP[Snapshot builder]
end
subgraph stores [Stores]
SQL[(Doc metadata DB)]
OBJ[(Blob store)]
SRCH[(Optional index)]
end
WEB --> LB
MOB --> LB
LB --> META
LB --> RT
META --> SQL
RT --> SEQ --> OT --> WAL
OT --> SNAP --> OBJ
WAL --> OBJ
Step 5 — Walk one edit end to end
Two users edit the same paragraph. User A inserts “Hello” at index 42.
- Client A generates op
{type: insert, index: 42, text: "Hello", client_rev: 118}and sends over WebSocket. - Gateway authenticates session, checks ACL
role >= editor, routes to document sharddoc_7xk. - Sequencer receives op, transforms against any concurrent ops since rev 118 (OT) or merges (CRDT), assigns
server_rev: 119. - WAL persists op 119 durably (quorum write) before ACK to clients.
- Fan-out pushes transformed op to all subscribed sessions (A, B, …).
- Client B applies op 119 locally; UI updates text and shifts B’s pending cursor indices.
- Presence (parallel channel) may show A’s cursor near index 42 without blocking the critical path.
sequenceDiagram
participant A as Client A
participant G as Gateway
participant S as Doc shard
participant B as Client B
A->>G: op insert @42 (client_rev 118)
G->>S: authorize + forward
S->>S: transform + assign rev 119
S->>S: append WAL
S-->>A: ACK rev 119
S-->>B: op 119 transformed
Step 6 — OT vs CRDT: keeping concurrent edits sane
Operational Transformation (OT) — clients send ops against a known revision; the server transforms concurrent ops so everyone converges. Classic Google Docs–era approach: server is authoritative; complex but battle-tested for rich text with tables.
CRDTs — data structures designed so merge is commutative; peers can sync without a central transformer in some designs. Popular in newer editors (Notion-class, Figma-class). Rich text CRDTs exist (Yjs, Automerge) but payload size and complexity differ from OT.
| Approach | Strength | Cost |
|---|---|---|
| OT + central server | Strong ordering; easier global undo policy | Server CPU per op; harder to do P2P |
| CRDT | Offline-friendly; peer sync | Larger states; tombstones; format migration |
Either way, clients must handle rebase: while offline, buffer ops; on reconnect, server sends missing rev range 119–140 for replay.
Step 7 — Realtime transport: WebSockets and session stickiness
Use WebSockets (or HTTP/2 streams) for bidirectional op traffic. Initial doc open:
GET snapshot + GET revisions?from=… over HTTPS, then upgrade socket for live ops.
- Stickiness — route
doc_idto the same gateway shard when possible to preserve ordering buffers. - Heartbeat — detect dead tabs; flush presence after TTL.
- Backpressure — if client cannot apply ops fast enough, pause ACK and shed non-editors from fan-out.
- Binary frames — protobuf or msgpack for ops at scale; JSON acceptable for teaching diagrams.
Step 8 — Document model: revisions, snapshots, compaction
Store an append-only revision log:
revisions(doc_id, rev, op_payload, author_id, ts) snapshots(doc_id, rev, snapshot_blob_ref, created_at)
Snapshot policy: every N revisions or M minutes, materialize full document state to object storage; new clients load latest snapshot + replay tail only. Compaction archives revisions older than last snapshot + legal retention window.
Large docs may split into segments (tabs, huge tables) each with its own sub-log to avoid one infinite hot row.
Step 9 — Permissions, sharing, and org policies
- Roles — owner, editor, commenter, viewer; mapped from users, groups, domain-wide links.
- Link types — restricted, anyone with link, public on web; enforced at gateway before sequencer.
- Inheritance — folder ACLs in Drive-class products; changes propagate via event bus to doc ACL cache.
- Enterprise — DLP rules (block external share), retention locks, download disabled for viewers.
Cache effective ACL in Redis with short TTL; invalidate on permission.changed events—never trust client-side checks alone.
Step 10 — Comments, suggestions, and presence
Comments anchor to stable identifiers (paragraph id + offset) not raw indices that shift every keystroke. Suggestions store proposed ops in a side branch; accept merges into canonical log; reject discards branch.
Presence — ephemeral data: cursor color, selection range, “User is typing…” — Redis or in-memory on gateway; lossy is OK.
Step 11 — Offline, reconnect, and conflict UX
- Client keeps local op queue + last known
server_rev. - On reconnect, send
catch_up from=server_rev+1; server streams missing ops. - Client rebases pending ops against incoming transforms; if impossible, show “copy your changes” modal.
- IndexedDB holds snapshot for read-only offline; sync when online returns.
Step 12 — Export, import, and side pipelines
Export is async: job reads snapshot + ops to target format (PDF via headless render, DOCX via converter service). Import parses uploaded file into initial snapshot + marks provenance. Virus scan and content policy run on uploads before merging into collaborative doc.
flowchart LR
DOC[Canonical doc state] --> JOB[Export worker]
JOB --> PDF[PDF]
JOB --> DOCX[DOCX]
UP[Upload] --> PARSE[Import parser] --> DOC
Step 13 — Scale: hot documents and sharding
- Shard key —
doc_idhashes to collaboration host; metadata may live in separate global DB. - Read-heavy viral doc — split viewers onto read-only fan-out that receives throttled updates (e.g. 5 Hz) while editors stay full rate.
- Op batching — coalesce keystrokes within 20–50 ms windows to cut WAL writes (trade latency for throughput).
- Cold docs — move tail logs to cheaper storage; first open triggers warm-up from snapshot only.
Step 14 — Technical layer: APIs and wire formats
| Operation | HTTP / WS | Success | Notes |
|---|---|---|---|
| Get metadata | GET /v1/documents/{id} |
200 |
Title, owners, mime, revision head |
| Get snapshot | GET /v1/documents/{id}/snapshot |
200 |
Binary or JSON model at rev |
| List revisions | GET /v1/documents/{id}/revisions?from=100&to=150 |
200 |
Paginated op log for catch-up |
| Submit ops (fallback) | POST /v1/documents/{id}/operations |
200 + new head rev |
When WebSocket unavailable |
| Live channel | WSS /v1/documents/{id}/channel |
Bi-directional | Ops, ACKs, presence events |
| Export | POST /v1/documents/{id}/exports |
202 + job id |
Poll GET …/exports/{job} for URL |
WebSocket message (illustrative JSON):
{
"type": "op_batch",
"doc_id": "doc_7xk",
"client_rev": 118,
"ops": [
{"op": "insert", "index": 42, "text": "Hello", "client_op_id": "c-op-9"}
]
}
→ server
{
"type": "ack",
"server_rev": 119,
"transformed_ops": [ … ],
"head_rev": 119
}
Logical tables
documents(id, owner_id, title, head_rev, snapshot_ref, …) revisions(doc_id, rev, op_json, user_id, ts) sessions(doc_id, session_id, user_id, gateway_id, last_rev) acl(doc_id, principal, role) export_jobs(id, doc_id, format, status, output_url)
Step 15 — Reliability, observability, and failure modes
Failure modes
- ACK without durable write — never; WAL first.
- Split brain on two sequencers — use leader election per doc shard; fencing tokens.
- Transform bug — feature flag new OT rules; replay test suite on production logs in shadow mode.
- Runaway paste — max op size; rate limit inserts per minute per user.
Observability
- Trace: open doc → snapshot load → catch-up → first op ACK.
- Metrics: op/s per doc, WAL lag, transform latency p95, reconnect rate, export queue depth.
- SLO example: 99.9% of ops ACK within 500 ms regional; zero lost acknowledged ops per quarter.
Step 16 — Goals → knobs (quick reference)
| Goal | Knob |
|---|---|
| Edits feel live | WebSockets, regional shards, small ops, batching tuned |
| Never lose work | WAL before ACK, snapshots, client offline queue |
| Opens stay fast | Snapshots every N revs, CDN for static assets, parallel metadata + snapshot fetch |
| Safe sharing | Server-side ACL on every path; link scope; audit log |
| Survive viral doc | Viewer throttling, dedicated hot shard, op batching |
Step 17 — Close the loop (what to practice)
On a whiteboard: three loops, two users editing one sentence, label WAL vs snapshot vs ACL.
Out loud: five functional requirements and which NFR is hardest for collaboration vs export.
With the technical section: trace one insert op from WebSocket to revision 119 ACK and broadcast.
The one line to remember
Google Docs–class systems are a ordered operation log with a realtime fan-out layer on top. Collaboration needs one revision sequence per document; everything else—search, export, analytics—reads that log, never fights it.