How the ChatGPT app works at scale
A consumer chat product looks like “type a question, get an answer.” Behind that sentence are three engineering planes: an experience plane (apps that stream partial text), an orchestration plane (auth, history, tools, policy), and an inference plane (GPU fleets that turn tokens into language under strict latency and cost budgets).
We build the design in order—requirements first, numbers second, architecture third, APIs last—using a ChatGPT-class product as the mental model, not any one vendor’s private implementation.
What you should be able to do after reading:
- Separate the three planes and say which store or service belongs on each.
- List functional and non-functional requirements with realistic TTFT and cost targets.
- Walk one chat turn: load history → assemble prompt → prefill → decode stream → persist → bill.
- Explain KV cache, continuous batching, and why tool calls are not “just another HTTP hop.”
- Read the technical section: Chat Completions, SSE chunks, conversation ids, and gateway routing.
Step 0 — How we will work through the problem
Ordered thinking beats memorizing a box diagram. Use this sequence when you design a large-language-model chat product:
- Clarify scope. Text only or multimodal? Consumer or enterprise with tenant isolation? Tools and RAG in scope?
- Write requirements. Functional = what users see. Non-functional = latency to first token, availability, privacy, unit economics.
- Do napkin math. Daily active users, messages per day, tokens per turn, GPU memory per request—so nobody assumes one GPU serves the planet.
- Draw three planes before naming vLLM, Pinecone, or Redis.
- Tell one story—a normal turn, then a tool turn, then a retry after a provider timeout.
flowchart LR
subgraph exp [Experience plane]
UI[Web / mobile / desktop]
end
subgraph orch [Orchestration plane]
BFF[BFF + auth]
OR[History tools policy]
end
subgraph inf [Inference plane]
GW[Model gateway]
GPU[GPU workers]
end
UI --> BFF --> OR --> GW --> GPU
OR --> PG[(Conversation store)]
OR --> VDB[(Vector index)]
Step 1 — Functional requirements (what users need)
Functional requirements describe behavior the product must ship. Missing one is a product bug, not a tuning exercise.
| Area | Requirement | Why scale makes it hard |
|---|---|---|
| Chat | Multi-turn conversation with streaming replies | Every turn reloads context; long threads blow the context window |
| History | List, rename, delete conversations; resume on any device | Durable store + hot cache; concurrent tabs |
| Models | Pick capability tier (fast vs capable) | Different GPU pools, routing, and price per token |
| Files | Upload PDFs, images, spreadsheets for Q&A | Parsing, chunking, and vision encoders are separate heavy paths |
| Tools | Browsing, code execution, plugins, custom GPT actions | Multi-step loops; each tool adds latency and failure modes |
| Memory (product) | Optional “remember this about me” across chats | Needs consent, deletion, and retrieval guardrails |
| Account | Sign-in, subscription tier, usage limits | Rate limits and billing must be consistent globally |
| Voice (optional) | Speech in / speech out | ASR + TTS pipelines; not the same SLO as text |
Functional details worth stating clearly
Idempotent turns. Clients send a request_id or idempotency key so retries after network blips do not double-charge or double-post assistant messages.
Tool loops are bounded. Orchestration caps steps (for example max 10 tool rounds) so a confused model cannot fork infinite jobs.
Out of scope today (say it aloud). Training the foundation model from scratch, federated on-device training, or real-time video generation—park them so the design stays focused.
Step 2 — Non-functional requirements (engineering promises)
| Category | Target (typical) | How we meet it | If we miss it |
|---|---|---|---|
| Latency — TTFT | p95 < 1–3 s to first token (model dependent) | Warm GPU pools, short queues, prompt compression | Feels “broken” even if total answer is fine |
| Latency — streaming | Steady token cadence (no 5 s gaps) | SSE flush, backpressure, avoid blocking on tools mid-stream | Users abandon mid-answer |
| Availability | 99.9%+ for chat API monthly | Multi-region gateways, model fallbacks | Global outage headlines |
| Durability | Conversation log rarely lost | Write assistant draft to DB before closing stream | Trust collapse (“it forgot our chat”) |
| Privacy | TLS in transit; retention policy; enterprise isolation | Encryption, tenant-scoped indexes, deletion jobs | Regulatory and PR risk |
| Cost | Revenue per user > inference + storage | Routing to smaller models, caching, batching | Unsustainable burn |
| Safety | Block policy violations before and after generation | Input classifiers + output filters + abuse rate limits | Harmful content at scale |
Key idea: Time-to-first-token is the UX metric users feel; tokens-per-second is the throughput metric finance feels. Optimize both, not one alone.
Step 3 — Napkin math (why one GPU is not enough)
Round numbers. Multiply in the open—you are showing magnitude, not audited financials.
- ~100M+ weekly active users on a flagship consumer chat product (order of magnitude).
- ~1B user messages per day → ~12k new chat turns per second on average (higher at peak).
- Assume 2k input + 500 output tokens per turn for a capable model → ~2.5k tokens billed per turn.
- Daily token volume ≈ 2.5 trillion tokens/day at that duty cycle—mostly inference, not storage.
- One H100-class GPU might sustain on the order of hundreds of decode tokens/s for a 70B-class model with batching (highly workload dependent). Thousands of GPUs are the honest fleet picture before redundancy and multi-model sprawl.
Storage is cheaper than inference but not free: 500 bytes–2 KB metadata per message row × billions of rows → sharded SQL or NoSQL, tiered to cold storage for old threads.
Step 4 — Architecture: three planes
Draw clients on the left, stateless API replicas in the middle, GPU inference on the right. Under orchestration: Postgres (or similar) for transcripts, Redis for session hot state, object storage for uploads, vector DB for RAG. The inference plane is reached only through a gateway that owns keys, routing, retries, and metering.
flowchart TB
subgraph clients [Clients]
WEB[Web]
IOS[iOS / Android]
API[Third-party API]
end
subgraph edge [Orchestration]
LB[Load balancer]
CHAT[Chat BFF]
ORCH[Orchestrator]
MOD[Safety filters]
end
subgraph data [Data]
PG[(Conversation DB)]
R[("Redis hot context")]
S3[(File uploads)]
VEC[(Embeddings index)]
end
subgraph infer [Inference]
GW[LLM gateway]
PREFILL[Prefill workers]
DECODE[Decode pool]
end
WEB --> LB
IOS --> LB
API --> LB
LB --> CHAT --> ORCH
ORCH --> MOD
ORCH --> PG
ORCH --> R
ORCH --> S3
ORCH --> VEC
ORCH --> GW
GW --> PREFILL
GW --> DECODE
Step 5 — Walk one chat turn end to end
Follow a single user message through the system. Names are illustrative.
- Client sends
POST /v1/chat/completionswithstream: true,conversation_id, and the new user message. - BFF validates JWT, checks subscription quota, loads rate-limit token bucket.
- Orchestrator fetches last N turns from Redis; on miss, rebuilds from Postgres. Optionally retrieves RAG chunks from the vector index.
- Prompt builder merges system policy, tool definitions, summaries of older turns, and the fresh user text into a token budget under the model limit.
- Safety — input runs classifiers (policy, PII, jailbreak). Hard block returns a refusal without burning a large GPU job.
- Gateway picks model route (for example
gpt-4ovsgpt-4o-mini), enqueues on the inference scheduler. - GPU worker runs prefill (process entire prompt in parallel) then decode (autoregressive tokens). KV cache stores attention keys/values so each new token does not re-read the full prompt from scratch.
- Stream emits SSE events:
delta.contentchunks untilfinish_reason: stoportool_calls. - Persist appends user + assistant rows to Postgres; updates Redis; emits usage record (input/output tokens) for billing.
- Safety — output may run async scans on the completed answer; revoke or flag if a late violation is detected.
stateDiagram-v2
[*] --> Auth
Auth --> LoadHistory: OK
LoadHistory --> BuildPrompt
BuildPrompt --> InputSafety
InputSafety --> Blocked: violation
InputSafety --> Inference: pass
Inference --> Streaming
Streaming --> ToolLoop: tool_calls
ToolLoop --> BuildPrompt
Streaming --> Persist: stop
Persist --> [*]
Blocked --> [*]
Step 6 — Context, memory, and summarization
Models have a fixed context window (for example 128k tokens). You cannot paste a ten-year transcript every turn. Production systems use a layered memory strategy:
- Hot window: last K turns verbatim in Redis for fast assembly.
- Summary block: older turns collapsed by a smaller model into a few paragraphs stored as a special message row.
- RAG snippets: retrieved chunks inserted with citations, not the entire corpus.
- Product memory: user-approved facts in a separate table with explicit delete and export.
Concurrency: two tabs appending to the same conversation_id need monotonic message_seq (compare-and-swap or DB transaction) so ordering stays coherent.
Step 7 — Inference serving: prefill, decode, KV cache
Transformer inference splits into two phases with different hardware appetites:
- Prefill — process the whole prompt at once; compute-bound; builds the KV cache.
- Decode — generate one token at a time; memory-bandwidth-bound; reuses KV cache.
KV cache stores per-layer key/value tensors for prior tokens so decode steps avoid recomputing attention over the full prefix. Memory grows with sequence length × batch size—long contexts and large batches are why HBM size matters.
Continuous batching (vLLM, TensorRT-LLM, similar) dynamically adds and removes requests in a batch instead of waiting for every stream to finish—raising GPU utilization without sacrificing streaming UX.
Sanity check: If p95 TTFT spikes but decode tokens/s is flat, suspect queueing or prefill saturation—not “the model got dumber.”
Step 8 — Model gateway and routing
Every internal service should talk to a gateway, not directly to fifteen vendor endpoints. The gateway owns:
- Routing — map product SKU to model id; fall back from primary to secondary on timeout.
- Retries — idempotent for safe reads; careful on partial streams (do not double-send visible tokens to the client).
- Quotas — per user, per org, per API key; shed load with
429before GPUs melt. - Observability — one trace id from BFF through gateway to GPU pod.
Speculative routing: try a small model first for classification or draft; escalate only when confidence is low—trading quality for cost on easy prompts.
Step 9 — RAG, tools, and agent loops
RAG (retrieval-augmented generation) grounds answers in your documents: ingest → chunk → embed → index → retrieve top-k at query time → inject into prompt. The data platform (ingestion workers, embedding jobs) is off the hot path but must stay fresh; stale indexes silently lie.
Tools expose structured actions (search web, run Python, call CRM). The model emits tool_calls; orchestration executes them and feeds tool role messages back—another prefill/decode cycle.
| Pattern | When to use | Cost driver |
|---|---|---|
| Single-shot chat | FAQ, drafting | One prefill + decode |
| RAG | Enterprise docs, support | Embedding search + longer prompt |
| Agent loop | Multi-step research | N × (tool latency + inference) |
Step 10 — Streaming on the wire (SSE)
Chat UIs use Server-Sent Events over HTTP: the client holds a long-lived response; the server pushes lines like data: {...}.
Each chunk carries a JSON fragment with choices[0].delta.content. A terminal event sends [DONE] or finish_reason.
Backpressure: if the client renders slowly, TCP buffers fill; the server should pause pulling from the GPU stream to avoid unbounded memory.
Heartbeat comments (: ping) keep intermediaries from closing idle connections.
HTTP/1.1 200 OK
Content-Type: text/event-stream
Cache-Control: no-cache
Connection: keep-alive
data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"role":"assistant"},"finish_reason":null}]}
data: {"id":"chatcmpl-abc","choices":[{"index":0,"delta":{"content":"Hello"},"finish_reason":null}]}
data: {"id":"chatcmpl-abc","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}
data: [DONE]
Step 11 — Safety and moderation pipeline
Safety is a pipeline, not a single model card disclaimer:
- Input filters — jailbreak patterns, CSAM hashes, malware in uploads.
- Policy in system prompt — refusals for disallowed advice categories.
- Output filters — classifiers on completed or partial text; truncate stream on violation.
- Abuse controls — IP/device rate limits, CAPTCHA ladders, account bans.
- Human review queue — sampled conversations for regression testing after model updates.
Log decision codes (blocked_input, blocked_output, allowed) with trace ids—not raw secrets—for audit without storing full prompts where policy forbids it.
Step 12 — Conversation storage and consistency
| Data | Store | Access pattern |
|---|---|---|
| Transcript rows | Sharded SQL / NoSQL | Append by conversation_id; paginate history |
| Hot context | Redis | GET last K messages; TTL on idle chats |
| Uploads | Object storage | Large blobs; virus scan async |
| Embeddings | Vector DB | ANN search by user/tenant id filter |
| Usage / billing | Columnar warehouse | Append-only events from gateway |
Invariants: (conversation_id, message_seq) unique; request_id dedupes retries;
soft-delete conversations with tombstones so backups and search indexes converge.
Step 13 — Multimodal side paths
Images, audio, and generated pictures are side doors—do not run them through the same hot path as a one-line text reply unless you accept very different SLOs.
- Vision — resize/chunk image → vision encoder → tokens appended to prompt; higher prefill cost.
- Whisper-class ASR — separate GPU job; returns text then normal chat proceeds.
- Image generation — diffusion cluster; async job id + polling or push notification; not token streaming.
flowchart LR
TEXT[Text chat] --> ORCH[Orchestrator]
IMG[Image upload] --> ENC[Vision encoder] --> ORCH
MIC[Voice] --> ASR[ASR service] --> ORCH
ORCH --> GW[Text LLM gateway]
Step 14 — Scale: queues, rate limits, and fleet growth
- Admission control — when GPU queue depth exceeds threshold, return
503withRetry-Afterinstead of multi-minute TTFT. - Priority tiers — paid users vs free; separate queues or weighted fair queuing.
- Autoscale — HPA on queue lag and GPU utilization; cold starts measured in minutes for large models.
- Regional cells — EU data stays in EU inference + storage; control plane routes by tenant residency.
Step 15 — Technical layer: APIs and payloads
Public products expose an OpenAI-compatible Chat Completions surface. Internal traffic is often gRPC between BFF, orchestrator, and gateway.
| Operation | HTTP | Success | Notes |
|---|---|---|---|
| Create completion | POST /v1/chat/completions |
200 (stream or JSON body) |
Authorization: Bearer; body includes messages[], model, stream |
| List models | GET /v1/models |
200 |
Capability discovery for clients |
| Upload file (RAG) | POST /v1/files |
200 + file_id |
Async processing before attachable to assistant |
| Rate limited | any | 429 |
Headers x-ratelimit-remaining, retry-after |
Non-streaming request (illustrative):
POST /v1/chat/completions HTTP/1.1
Host: api.example.com
Authorization: Bearer sk-…
Content-Type: application/json
X-Request-Id: 7f3c2a1b-…
{
"model": "gpt-4o",
"conversation_id": "conv_01HQ…",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain KV cache in two paragraphs."}
],
"stream": false,
"max_tokens": 512,
"temperature": 0.7
}
Tool call fragment in a stream chunk:
data: {"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"id":"call_abc","type":"function","function":{"name":"search_web","arguments":"{\"q\":\""}}]}}]}
Core tables (logical schema)
users(id, plan, created_at) conversations(id, user_id, title, created_at, deleted_at) messages(id, conversation_id, seq, role, content_json, token_count, created_at) usage_events(id, user_id, model, input_tokens, output_tokens, request_id, ts) files(id, user_id, object_key, status, sha256)
Step 16 — Reliability, observability, and cost
Reliability
- Idempotent writes on
request_idbefore charging tokens. - Graceful degradation — route to smaller model; shorten context; disable tools under load.
- Partial stream failure — mark assistant message
status=incompleteso UI can retry continue.
Observability
- Trace: BFF → orchestrator → gateway → GPU pod with
conversation_id,model,prompt_tokens. - Metrics: TTFT histogram, inter-token latency, queue depth, GPU KV cache usage,
429rate. - Logs: structured JSON; redact message bodies in production where required.
Cost controls
- Meter every completion at gateway; attribute to tenant for chargeback.
- Cache embeddings for unchanged documents; cache repeated system prompts where safe.
- Batch offline eval jobs on spot GPUs separate from interactive pools.
Step 17 — Goals → knobs (quick reference)
| Goal | Knob |
|---|---|
| Answers feel instant | Smaller default model, regional GPUs, prompt compression, cap tools |
| High quality on hard tasks | Route to larger model; allow longer decode; multi-step agent with budget |
| Stays profitable | Token metering, batching, distillation, aggressive caching of RAG |
| Grounded in company data | Fresh RAG index, citations in UI, permission filters on retrieval |
| Safe at scale | Layered moderation, rate limits, human eval on model version bumps |
Step 18 — Close the loop (what to practice)
On a whiteboard: three planes, one chat turn, label Postgres vs Redis vs GPU gateway on each step.
Out loud: five functional requirements and which non-functional target applies to orchestration vs inference.
With the technical section: trace POST /v1/chat/completions with stream: true through SSE to persistence.
The one line to remember
ChatGPT-class products are three systems behind one text box: orchestration (who you are, what you said, what tools apply), inference (GPUs turning tokens into language), and experience (streaming UI that hides the machinery). Draw the plane boundaries first and the rest of the design stays teachable at global scale.