How the ChatGPT app works at scale

A consumer chat product looks like “type a question, get an answer.” Behind that sentence are three engineering planes: an experience plane (apps that stream partial text), an orchestration plane (auth, history, tools, policy), and an inference plane (GPU fleets that turn tokens into language under strict latency and cost budgets).

We build the design in order—requirements first, numbers second, architecture third, APIs last—using a ChatGPT-class product as the mental model, not any one vendor’s private implementation.

What you should be able to do after reading:

Separate the three planes and say which store or service belongs on each.
List functional and non-functional requirements with realistic TTFT and cost targets.
Walk one chat turn: load history → assemble prompt → prefill → decode stream → persist → bill.
Explain KV cache, continuous batching, and why tool calls are not “just another HTTP hop.”
Read the technical section: Chat Completions, SSE chunks, conversation ids, and gateway routing.

Step 0 — How we will work through the problem

Ordered thinking beats memorizing a box diagram. Use this sequence when you design a large-language-model chat product:

Clarify scope. Text only or multimodal? Consumer or enterprise with tenant isolation? Tools and RAG in scope?
Write requirements. Functional = what users see. Non-functional = latency to first token, availability, privacy, unit economics.
Do napkin math. Daily active users, messages per day, tokens per turn, GPU memory per request—so nobody assumes one GPU serves the planet.
Draw three planes before naming vLLM, Pinecone, or Redis.
Tell one story—a normal turn, then a tool turn, then a retry after a provider timeout.

flowchart LR
  subgraph exp [Experience plane]
    UI[Web / mobile / desktop]
  end
  subgraph orch [Orchestration plane]
    BFF[BFF + auth]
    OR[History tools policy]
  end
  subgraph inf [Inference plane]
    GW[Model gateway]
    GPU[GPU workers]
  end
  UI --> BFF --> OR --> GW --> GPU
  OR --> PG[(Conversation store)]
  OR --> VDB[(Vector index)]

Step 1 — Functional requirements (what users need)

Functional requirements describe behavior the product must ship. Missing one is a product bug, not a tuning exercise.

Area	Requirement	Why scale makes it hard
Chat	Multi-turn conversation with streaming replies	Every turn reloads context; long threads blow the context window
History	List, rename, delete conversations; resume on any device	Durable store + hot cache; concurrent tabs
Models	Pick capability tier (fast vs capable)	Different GPU pools, routing, and price per token
Files	Upload PDFs, images, spreadsheets for Q&A	Parsing, chunking, and vision encoders are separate heavy paths
Tools	Browsing, code execution, plugins, custom GPT actions	Multi-step loops; each tool adds latency and failure modes
Memory (product)	Optional “remember this about me” across chats	Needs consent, deletion, and retrieval guardrails
Account	Sign-in, subscription tier, usage limits	Rate limits and billing must be consistent globally
Voice (optional)	Speech in / speech out	ASR + TTS pipelines; not the same SLO as text

Functional details worth stating clearly

Idempotent turns. Clients send a request_id or idempotency key so retries after network blips do not double-charge or double-post assistant messages.

Tool loops are bounded. Orchestration caps steps (for example max 10 tool rounds) so a confused model cannot fork infinite jobs.

Out of scope today (say it aloud). Training the foundation model from scratch, federated on-device training, or real-time video generation—park them so the design stays focused.

Step 2 — Non-functional requirements (engineering promises)

Category	Target (typical)	How we meet it	If we miss it
Latency — TTFT	p95 < 1–3 s to first token (model dependent)	Warm GPU pools, short queues, prompt compression	Feels “broken” even if total answer is fine
Latency — streaming	Steady token cadence (no 5 s gaps)	SSE flush, backpressure, avoid blocking on tools mid-stream	Users abandon mid-answer
Availability	99.9%+ for chat API monthly	Multi-region gateways, model fallbacks	Global outage headlines
Durability	Conversation log rarely lost	Write assistant draft to DB before closing stream	Trust collapse (“it forgot our chat”)
Privacy	TLS in transit; retention policy; enterprise isolation	Encryption, tenant-scoped indexes, deletion jobs	Regulatory and PR risk
Cost	Revenue per user > inference + storage	Routing to smaller models, caching, batching	Unsustainable burn
Safety	Block policy violations before and after generation	Input classifiers + output filters + abuse rate limits	Harmful content at scale

Key idea: Time-to-first-token is the UX metric users feel; tokens-per-second is the throughput metric finance feels. Optimize both, not one alone.

Step 3 — Napkin math (why one GPU is not enough)

Round numbers. Multiply in the open—you are showing magnitude, not audited financials.

~100M+ weekly active users on a flagship consumer chat product (order of magnitude).
~1B user messages per day → ~12k new chat turns per second on average (higher at peak).
Assume 2k input + 500 output tokens per turn for a capable model → ~2.5k tokens billed per turn.
Daily token volume ≈ 2.5 trillion tokens/day at that duty cycle—mostly inference, not storage.
One H100-class GPU might sustain on the order of hundreds of decode tokens/s for a 70B-class model with batching (highly workload dependent). Thousands of GPUs are the honest fleet picture before redundancy and multi-model sprawl.

Storage is cheaper than inference but not free: 500 bytes–2 KB metadata per message row × billions of rows → sharded SQL or NoSQL, tiered to cold storage for old threads.

Step 4 — Architecture: three planes

Draw clients on the left, stateless API replicas in the middle, GPU inference on the right. Under orchestration: Postgres (or similar) for transcripts, Redis for session hot state, object storage for uploads, vector DB for RAG. The inference plane is reached only through a gateway that owns keys, routing, retries, and metering.

flowchart TB
  subgraph clients [Clients]
    WEB[Web]
    IOS[iOS / Android]
    API[Third-party API]
  end
  subgraph edge [Orchestration]
    LB[Load balancer]
    CHAT[Chat BFF]
    ORCH[Orchestrator]
    MOD[Safety filters]
  end
  subgraph data [Data]
    PG[(Conversation DB)]
    R[("Redis hot context")]
    S3[(File uploads)]
    VEC[(Embeddings index)]
  end
  subgraph infer [Inference]
    GW[LLM gateway]
    PREFILL[Prefill workers]
    DECODE[Decode pool]
  end
  WEB --> LB
  IOS --> LB
  API --> LB
  LB --> CHAT --> ORCH
  ORCH --> MOD
  ORCH --> PG
  ORCH --> R
  ORCH --> S3
  ORCH --> VEC
  ORCH --> GW
  GW --> PREFILL
  GW --> DECODE

Step 5 — Walk one chat turn end to end

Follow a single user message through the system. Names are illustrative.

Client sends POST /v1/chat/completions with stream: true, conversation_id, and the new user message.
BFF validates JWT, checks subscription quota, loads rate-limit token bucket.
Orchestrator fetches last N turns from Redis; on miss, rebuilds from Postgres. Optionally retrieves RAG chunks from the vector index.
Prompt builder merges system policy, tool definitions, summaries of older turns, and the fresh user text into a token budget under the model limit.
Safety — input runs classifiers (policy, PII, jailbreak). Hard block returns a refusal without burning a large GPU job.
Gateway picks model route (for example gpt-4o vs gpt-4o-mini), enqueues on the inference scheduler.
GPU worker runs prefill (process entire prompt in parallel) then decode (autoregressive tokens). KV cache stores attention keys/values so each new token does not re-read the full prompt from scratch.
Stream emits SSE events: delta.content chunks until finish_reason: stop or tool_calls.
Persist appends user + assistant rows to Postgres; updates Redis; emits usage record (input/output tokens) for billing.
Safety — output may run async scans on the completed answer; revoke or flag if a late violation is detected.

stateDiagram-v2
  [*] --> Auth
  Auth --> LoadHistory: OK
  LoadHistory --> BuildPrompt
  BuildPrompt --> InputSafety
  InputSafety --> Blocked: violation
  InputSafety --> Inference: pass
  Inference --> Streaming
  Streaming --> ToolLoop: tool_calls
  ToolLoop --> BuildPrompt
  Streaming --> Persist: stop
  Persist --> [*]
  Blocked --> [*]

Step 6 — Context, memory, and summarization

Models have a fixed context window (for example 128k tokens). You cannot paste a ten-year transcript every turn. Production systems use a layered memory strategy:

Hot window: last K turns verbatim in Redis for fast assembly.
Summary block: older turns collapsed by a smaller model into a few paragraphs stored as a special message row.
RAG snippets: retrieved chunks inserted with citations, not the entire corpus.
Product memory: user-approved facts in a separate table with explicit delete and export.

Concurrency: two tabs appending to the same conversation_id need monotonic message_seq (compare-and-swap or DB transaction) so ordering stays coherent.

Step 7 — Inference serving: prefill, decode, KV cache

Transformer inference splits into two phases with different hardware appetites:

Prefill — process the whole prompt at once; compute-bound; builds the KV cache.
Decode — generate one token at a time; memory-bandwidth-bound; reuses KV cache.

KV cache stores per-layer key/value tensors for prior tokens so decode steps avoid recomputing attention over the full prefix. Memory grows with sequence length × batch size—long contexts and large batches are why HBM size matters.

Continuous batching (vLLM, TensorRT-LLM, similar) dynamically adds and removes requests in a batch instead of waiting for every stream to finish—raising GPU utilization without sacrificing streaming UX.

Sanity check: If p95 TTFT spikes but decode tokens/s is flat, suspect queueing or prefill saturation—not “the model got dumber.”

Step 8 — Model gateway and routing

Every internal service should talk to a gateway, not directly to fifteen vendor endpoints. The gateway owns:

Routing — map product SKU to model id; fall back from primary to secondary on timeout.
Retries — idempotent for safe reads; careful on partial streams (do not double-send visible tokens to the client).
Quotas — per user, per org, per API key; shed load with 429 before GPUs melt.
Observability — one trace id from BFF through gateway to GPU pod.

Speculative routing: try a small model first for classification or draft; escalate only when confidence is low—trading quality for cost on easy prompts.

Step 9 — RAG, tools, and agent loops

RAG (retrieval-augmented generation) grounds answers in your documents: ingest → chunk → embed → index → retrieve top-k at query time → inject into prompt. The data platform (ingestion workers, embedding jobs) is off the hot path but must stay fresh; stale indexes silently lie.

Tools expose structured actions (search web, run Python, call CRM). The model emits tool_calls; orchestration executes them and feeds tool role messages back—another prefill/decode cycle.

Pattern	When to use	Cost driver
Single-shot chat	FAQ, drafting	One prefill + decode
RAG	Enterprise docs, support	Embedding search + longer prompt
Agent loop	Multi-step research	N × (tool latency + inference)

Step 10 — Streaming on the wire (SSE)

Chat UIs use Server-Sent Events over HTTP: the client holds a long-lived response; the server pushes lines like data: {...}. Each chunk carries a JSON fragment with choices[0].delta.content. A terminal event sends [DONE] or finish_reason.

Backpressure: if the client renders slowly, TCP buffers fill; the server should pause pulling from the GPU stream to avoid unbounded memory. Heartbeat comments (: ping) keep intermediaries from closing idle connections.

HTTP/1.1 200 OK
Content-Type: text/event-stream
Cache-Control: no-cache
Connection: keep-alive

data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"role":"assistant"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc","choices":[{"index":0,"delta":{"content":"Hello"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}

data: [DONE]

Step 11 — Safety and moderation pipeline

Safety is a pipeline, not a single model card disclaimer:

Input filters — jailbreak patterns, CSAM hashes, malware in uploads.
Policy in system prompt — refusals for disallowed advice categories.
Output filters — classifiers on completed or partial text; truncate stream on violation.
Abuse controls — IP/device rate limits, CAPTCHA ladders, account bans.
Human review queue — sampled conversations for regression testing after model updates.

Log decision codes (blocked_input, blocked_output, allowed) with trace ids—not raw secrets—for audit without storing full prompts where policy forbids it.

Step 12 — Conversation storage and consistency

Data	Store	Access pattern
Transcript rows	Sharded SQL / NoSQL	Append by `conversation_id`; paginate history
Hot context	Redis	GET last K messages; TTL on idle chats
Uploads	Object storage	Large blobs; virus scan async
Embeddings	Vector DB	ANN search by user/tenant id filter
Usage / billing	Columnar warehouse	Append-only events from gateway

Invariants: (conversation_id, message_seq) unique; request_id dedupes retries; soft-delete conversations with tombstones so backups and search indexes converge.

Step 13 — Multimodal side paths

Images, audio, and generated pictures are side doors—do not run them through the same hot path as a one-line text reply unless you accept very different SLOs.

Vision — resize/chunk image → vision encoder → tokens appended to prompt; higher prefill cost.
Whisper-class ASR — separate GPU job; returns text then normal chat proceeds.
Image generation — diffusion cluster; async job id + polling or push notification; not token streaming.

flowchart LR
  TEXT[Text chat] --> ORCH[Orchestrator]
  IMG[Image upload] --> ENC[Vision encoder] --> ORCH
  MIC[Voice] --> ASR[ASR service] --> ORCH
  ORCH --> GW[Text LLM gateway]

Step 14 — Scale: queues, rate limits, and fleet growth

Admission control — when GPU queue depth exceeds threshold, return 503 with Retry-After instead of multi-minute TTFT.
Priority tiers — paid users vs free; separate queues or weighted fair queuing.
Autoscale — HPA on queue lag and GPU utilization; cold starts measured in minutes for large models.
Regional cells — EU data stays in EU inference + storage; control plane routes by tenant residency.

Step 15 — Technical layer: APIs and payloads

Public products expose an OpenAI-compatible Chat Completions surface. Internal traffic is often gRPC between BFF, orchestrator, and gateway.

Operation	HTTP	Success	Notes
Create completion	`POST /v1/chat/completions`	`200` (stream or JSON body)	`Authorization: Bearer`; body includes `messages[]`, `model`, `stream`
List models	`GET /v1/models`	`200`	Capability discovery for clients
Upload file (RAG)	`POST /v1/files`	`200` + `file_id`	Async processing before attachable to assistant
Rate limited	any	`429`	Headers `x-ratelimit-remaining`, `retry-after`

Non-streaming request (illustrative):

POST /v1/chat/completions HTTP/1.1
Host: api.example.com
Authorization: Bearer sk-…
Content-Type: application/json
X-Request-Id: 7f3c2a1b-…

{
  "model": "gpt-4o",
  "conversation_id": "conv_01HQ…",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Explain KV cache in two paragraphs."}
  ],
  "stream": false,
  "max_tokens": 512,
  "temperature": 0.7
}

Tool call fragment in a stream chunk:

data: {"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"id":"call_abc","type":"function","function":{"name":"search_web","arguments":"{\"q\":\""}}]}}]}

Core tables (logical schema)

users(id, plan, created_at)
conversations(id, user_id, title, created_at, deleted_at)
messages(id, conversation_id, seq, role, content_json, token_count, created_at)
usage_events(id, user_id, model, input_tokens, output_tokens, request_id, ts)
files(id, user_id, object_key, status, sha256)

Step 16 — Reliability, observability, and cost

Reliability

Idempotent writes on request_id before charging tokens.
Graceful degradation — route to smaller model; shorten context; disable tools under load.
Partial stream failure — mark assistant message status=incomplete so UI can retry continue.

Observability

Trace: BFF → orchestrator → gateway → GPU pod with conversation_id, model, prompt_tokens.
Metrics: TTFT histogram, inter-token latency, queue depth, GPU KV cache usage, 429 rate.
Logs: structured JSON; redact message bodies in production where required.

Cost controls

Meter every completion at gateway; attribute to tenant for chargeback.
Cache embeddings for unchanged documents; cache repeated system prompts where safe.
Batch offline eval jobs on spot GPUs separate from interactive pools.

Step 17 — Goals → knobs (quick reference)

Goal	Knob
Answers feel instant	Smaller default model, regional GPUs, prompt compression, cap tools
High quality on hard tasks	Route to larger model; allow longer decode; multi-step agent with budget
Stays profitable	Token metering, batching, distillation, aggressive caching of RAG
Grounded in company data	Fresh RAG index, citations in UI, permission filters on retrieval
Safe at scale	Layered moderation, rate limits, human eval on model version bumps

Step 18 — Close the loop (what to practice)

On a whiteboard: three planes, one chat turn, label Postgres vs Redis vs GPU gateway on each step.

Out loud: five functional requirements and which non-functional target applies to orchestration vs inference.

With the technical section: trace POST /v1/chat/completions with stream: true through SSE to persistence.

The one line to remember

ChatGPT-class products are three systems behind one text box: orchestration (who you are, what you said, what tools apply), inference (GPUs turning tokens into language), and experience (streaming UI that hides the machinery). Draw the plane boundaries first and the rest of the design stays teachable at global scale.