How Google Docs works at scale

Collaborative editing looks like a word processor in the browser. At scale it is a distributed log of tiny edits that many people apply at once—plus permissions, revision history, and export pipelines that must never corrupt the document.

We work through the design in order—requirements first, numbers second, architecture third, APIs last—using a Google Docs–class product as the mental model, not any one company’s private implementation.

What you should be able to do after reading:

Separate the three loops—collaboration, document persistence, and access control.
List functional and non-functional requirements for editors, viewers, and admins.
Walk one keystroke from client op → server transform → broadcast → durable revision.
Contrast OT vs CRDT and explain revision numbers and server authority.
Read the technical section: WebSocket frames, operation batches, and REST metadata APIs.

Step 0 — How we will work through the problem

Ordered thinking beats memorizing boxes. Use this sequence when you design real-time collaborative documents:

Clarify scope. Docs only or Sheets/Slides too? Offline editing? Guest commenters? Enterprise DLP in scope?
Write requirements. Functional = editing, sharing, history. Non-functional = sync latency, durability, never lose edits.
Do napkin math. Active docs, ops per second per doc, revision row growth—so nobody stores the whole internet in one JSON blob.
Draw three loops before naming WebSockets or Spanner.
Tell one story—two users type in the same paragraph—then failure cases (partition, reconnect, conflicting paste).

flowchart TB
  subgraph collab [Collaboration loop]
    C1[Client A ops] --> WS[Realtime gateway]
    C2[Client B ops] --> WS
    WS --> OT[Transform / order]
    OT --> FAN[Broadcast to sessions]
  end
  subgraph doc [Document loop]
    OT --> LOG[Append revision log]
    LOG --> SNAP[Periodic snapshots]
    SNAP --> BLOB[(Object store)]
  end
  subgraph access [Access loop]
    ACL[Permissions service] --> WS
    ACL --> API[REST metadata]
    SHARE[Share links] --> ACL
  end

Step 1 — Functional requirements (editors, viewers, admins)

Actor	Requirement	Why scale makes it hard
Editor	Type, format, insert images/tables, undo/redo	Every keystroke is an op; fan-out to N collaborators
Editor	See others’ cursors and selections in near real time	Presence channel separate from document ops
Viewer	Read-only open; copy allowed per policy	Same render path without accepting ops
Commenter	Comments and @mentions without editing body	Comment anchoring to volatile positions
Suggest mode	Proposed edits require owner accept/reject	Branching suggestion layer on top of canonical doc
Owner	Share by user/group/link; transfer ownership	ACL evaluation on every op and API call
All	Version history; restore named revision	Compaction vs infinite op log
All	Export PDF, DOCX, etc.	Async render farm reads snapshot + tail ops
Admin	Org policies, retention, legal hold, audit	eDiscovery exports across millions of files

Functional details worth stating clearly

Operations are the source of truth, not the HTML DOM. The server stores a ordered sequence of ops (or periodic snapshots + tail).

Undo is local and global. Local undo stacks invert recent ops; server may still reject if revision moved.

Out of scope today (say it aloud). Building a full layout engine from scratch, real-time video co-editing, or on-prem blockchain audit—park them.

Step 2 — Non-functional requirements (engineering promises)

Category	Target (typical)	How we meet it	If we miss it
Latency — remote edit visible	p95 < 200 ms same region	WebSocket, regional doc shards, small op payloads	Feels like email, not “live”
Latency — open doc	p95 < 2 s cold start	Latest snapshot + tail replay	Users think tab crashed
Durability	No acknowledged op lost	Write-ahead log before ACK to client	Trust destroyed forever
Consistency	Total order of ops per document	Single sequencer per doc shard	Garbled text, divergent forks
Availability	99.9%+ edit path monthly	Replica gateways, doc shard failover	Classroom / deal room stops
Scale — hot doc	100+ viewers; 10–20 concurrent editors	Op batching, read-only fan-out path	Gateway meltdown on viral doc
Scale — corpus	Billions of files	Shard by `doc_id`, tier cold storage	One DB owns everything
Security	Least privilege per op	ACL check at gateway + storage	Link leak edits entire company drive

Key idea: Collaboration wants a single authoritative order per document. Analytics and search can be eventual; the edit log cannot.

Step 3 — Napkin math (why one JSON file is not enough)

~2B+ monthly active Google Workspace users (ecosystem scale—not all editing at once).
Suppose 500M docs receive at least one edit per day; average 50 ops per active doc per day → 25B ops/day ≈ 300k ops/s globally (peaks higher in business hours).
Each op might be 50–500 bytes JSON → tens of TB of log per day before compaction—snapshots and garbage collection are mandatory.
A viral doc with 200 viewers and 20 editors could see 100+ ops/s—one hot shard; needs batching and possibly dedicated collaboration host.
Images in docs: multi-MB objects in blob store; ops only reference image_id.

Step 4 — Architecture: three loops

Browser clients talk to an edge API for metadata (title, permissions) and a realtime gateway for ops. Each document maps to a shard with a sequencer that assigns monotonic revision numbers. The document loop appends to a write-ahead log, compacts into snapshots, and stores large blobs separately.

flowchart TB
  subgraph clients [Clients]
    WEB[Browser editor]
    MOB[Mobile app]
  end
  subgraph edge [Edge]
    LB[Load balancer]
    META[Metadata API]
    RT[Realtime gateway]
  end
  subgraph core [Document shard]
    SEQ[Sequencer]
    OT[OT / CRDT engine]
    WAL[(Revision log)]
    SNAP[Snapshot builder]
  end
  subgraph stores [Stores]
    SQL[(Doc metadata DB)]
    OBJ[(Blob store)]
    SRCH[(Optional index)]
  end
  WEB --> LB
  MOB --> LB
  LB --> META
  LB --> RT
  META --> SQL
  RT --> SEQ --> OT --> WAL
  OT --> SNAP --> OBJ
  WAL --> OBJ

Step 5 — Walk one edit end to end

Two users edit the same paragraph. User A inserts “Hello” at index 42.

Client A generates op {type: insert, index: 42, text: "Hello", client_rev: 118} and sends over WebSocket.
Gateway authenticates session, checks ACL role >= editor, routes to document shard doc_7xk.
Sequencer receives op, transforms against any concurrent ops since rev 118 (OT) or merges (CRDT), assigns server_rev: 119.
WAL persists op 119 durably (quorum write) before ACK to clients.
Fan-out pushes transformed op to all subscribed sessions (A, B, …).
Client B applies op 119 locally; UI updates text and shifts B’s pending cursor indices.
Presence (parallel channel) may show A’s cursor near index 42 without blocking the critical path.

sequenceDiagram
  participant A as Client A
  participant G as Gateway
  participant S as Doc shard
  participant B as Client B
  A->>G: op insert @42 (client_rev 118)
  G->>S: authorize + forward
  S->>S: transform + assign rev 119
  S->>S: append WAL
  S-->>A: ACK rev 119
  S-->>B: op 119 transformed

Step 6 — OT vs CRDT: keeping concurrent edits sane

Operational Transformation (OT) — clients send ops against a known revision; the server transforms concurrent ops so everyone converges. Classic Google Docs–era approach: server is authoritative; complex but battle-tested for rich text with tables.

CRDTs — data structures designed so merge is commutative; peers can sync without a central transformer in some designs. Popular in newer editors (Notion-class, Figma-class). Rich text CRDTs exist (Yjs, Automerge) but payload size and complexity differ from OT.

Approach	Strength	Cost
OT + central server	Strong ordering; easier global undo policy	Server CPU per op; harder to do P2P
CRDT	Offline-friendly; peer sync	Larger states; tombstones; format migration

Either way, clients must handle rebase: while offline, buffer ops; on reconnect, server sends missing rev range 119–140 for replay.

Step 7 — Realtime transport: WebSockets and session stickiness

Use WebSockets (or HTTP/2 streams) for bidirectional op traffic. Initial doc open: GET snapshot + GET revisions?from=… over HTTPS, then upgrade socket for live ops.

Stickiness — route doc_id to the same gateway shard when possible to preserve ordering buffers.
Heartbeat — detect dead tabs; flush presence after TTL.
Backpressure — if client cannot apply ops fast enough, pause ACK and shed non-editors from fan-out.
Binary frames — protobuf or msgpack for ops at scale; JSON acceptable for teaching diagrams.

Step 8 — Document model: revisions, snapshots, compaction

Store an append-only revision log:

revisions(doc_id, rev, op_payload, author_id, ts)
snapshots(doc_id, rev, snapshot_blob_ref, created_at)

Snapshot policy: every N revisions or M minutes, materialize full document state to object storage; new clients load latest snapshot + replay tail only. Compaction archives revisions older than last snapshot + legal retention window.

Large docs may split into segments (tabs, huge tables) each with its own sub-log to avoid one infinite hot row.

Step 9 — Permissions, sharing, and org policies

Roles — owner, editor, commenter, viewer; mapped from users, groups, domain-wide links.
Link types — restricted, anyone with link, public on web; enforced at gateway before sequencer.
Inheritance — folder ACLs in Drive-class products; changes propagate via event bus to doc ACL cache.
Enterprise — DLP rules (block external share), retention locks, download disabled for viewers.

Cache effective ACL in Redis with short TTL; invalidate on permission.changed events—never trust client-side checks alone.

Step 10 — Comments, suggestions, and presence

Comments anchor to stable identifiers (paragraph id + offset) not raw indices that shift every keystroke. Suggestions store proposed ops in a side branch; accept merges into canonical log; reject discards branch.

Presence — ephemeral data: cursor color, selection range, “User is typing…” — Redis or in-memory on gateway; lossy is OK.

Step 11 — Offline, reconnect, and conflict UX

Client keeps local op queue + last known server_rev.
On reconnect, send catch_up from=server_rev+1; server streams missing ops.
Client rebases pending ops against incoming transforms; if impossible, show “copy your changes” modal.
IndexedDB holds snapshot for read-only offline; sync when online returns.

Step 12 — Export, import, and side pipelines

Export is async: job reads snapshot + ops to target format (PDF via headless render, DOCX via converter service). Import parses uploaded file into initial snapshot + marks provenance. Virus scan and content policy run on uploads before merging into collaborative doc.

flowchart LR
  DOC[Canonical doc state] --> JOB[Export worker]
  JOB --> PDF[PDF]
  JOB --> DOCX[DOCX]
  UP[Upload] --> PARSE[Import parser] --> DOC

Step 13 — Scale: hot documents and sharding

Shard key — doc_id hashes to collaboration host; metadata may live in separate global DB.
Read-heavy viral doc — split viewers onto read-only fan-out that receives throttled updates (e.g. 5 Hz) while editors stay full rate.
Op batching — coalesce keystrokes within 20–50 ms windows to cut WAL writes (trade latency for throughput).
Cold docs — move tail logs to cheaper storage; first open triggers warm-up from snapshot only.

Step 14 — Technical layer: APIs and wire formats

Operation	HTTP / WS	Success	Notes
Get metadata	`GET /v1/documents/{id}`	`200`	Title, owners, mime, revision head
Get snapshot	`GET /v1/documents/{id}/snapshot`	`200`	Binary or JSON model at `rev`
List revisions	`GET /v1/documents/{id}/revisions?from=100&to=150`	`200`	Paginated op log for catch-up
Submit ops (fallback)	`POST /v1/documents/{id}/operations`	`200` + new head rev	When WebSocket unavailable
Live channel	`WSS /v1/documents/{id}/channel`	Bi-directional	Ops, ACKs, presence events
Export	`POST /v1/documents/{id}/exports`	`202` + job id	Poll `GET …/exports/{job}` for URL

WebSocket message (illustrative JSON):

{
  "type": "op_batch",
  "doc_id": "doc_7xk",
  "client_rev": 118,
  "ops": [
    {"op": "insert", "index": 42, "text": "Hello", "client_op_id": "c-op-9"}
  ]
}

→ server
{
  "type": "ack",
  "server_rev": 119,
  "transformed_ops": [ … ],
  "head_rev": 119
}

Logical tables

documents(id, owner_id, title, head_rev, snapshot_ref, …)
revisions(doc_id, rev, op_json, user_id, ts)
sessions(doc_id, session_id, user_id, gateway_id, last_rev)
acl(doc_id, principal, role)
export_jobs(id, doc_id, format, status, output_url)

Step 15 — Reliability, observability, and failure modes

Failure modes

ACK without durable write — never; WAL first.
Split brain on two sequencers — use leader election per doc shard; fencing tokens.
Transform bug — feature flag new OT rules; replay test suite on production logs in shadow mode.
Runaway paste — max op size; rate limit inserts per minute per user.

Observability

Trace: open doc → snapshot load → catch-up → first op ACK.
Metrics: op/s per doc, WAL lag, transform latency p95, reconnect rate, export queue depth.
SLO example: 99.9% of ops ACK within 500 ms regional; zero lost acknowledged ops per quarter.

Step 16 — Goals → knobs (quick reference)

Goal	Knob
Edits feel live	WebSockets, regional shards, small ops, batching tuned
Never lose work	WAL before ACK, snapshots, client offline queue
Opens stay fast	Snapshots every N revs, CDN for static assets, parallel metadata + snapshot fetch
Safe sharing	Server-side ACL on every path; link scope; audit log
Survive viral doc	Viewer throttling, dedicated hot shard, op batching

Step 17 — Close the loop (what to practice)

On a whiteboard: three loops, two users editing one sentence, label WAL vs snapshot vs ACL.

Out loud: five functional requirements and which NFR is hardest for collaboration vs export.

With the technical section: trace one insert op from WebSocket to revision 119 ACK and broadcast.

The one line to remember

Google Docs–class systems are a ordered operation log with a realtime fan-out layer on top. Collaboration needs one revision sequence per document; everything else—search, export, analytics—reads that log, never fights it.