sharpbyte.dev

How the ChatGPT app works at scale

A consumer chat product looks like “type a question, get an answer.” Behind that sentence are three engineering planes: an experience plane (apps that stream partial text), an orchestration plane (auth, history, tools, policy), and an inference plane (GPU fleets that turn tokens into language under strict latency and cost budgets).

We build the design in order—requirements first, numbers second, architecture third, APIs last—using a ChatGPT-class product as the mental model, not any one vendor’s private implementation.

What you should be able to do after reading:

Step 0 — How we will work through the problem

Ordered thinking beats memorizing a box diagram. Use this sequence when you design a large-language-model chat product:

  1. Clarify scope. Text only or multimodal? Consumer or enterprise with tenant isolation? Tools and RAG in scope?
  2. Write requirements. Functional = what users see. Non-functional = latency to first token, availability, privacy, unit economics.
  3. Do napkin math. Daily active users, messages per day, tokens per turn, GPU memory per request—so nobody assumes one GPU serves the planet.
  4. Draw three planes before naming vLLM, Pinecone, or Redis.
  5. Tell one story—a normal turn, then a tool turn, then a retry after a provider timeout.
flowchart LR
  subgraph exp [Experience plane]
    UI[Web / mobile / desktop]
  end
  subgraph orch [Orchestration plane]
    BFF[BFF + auth]
    OR[History tools policy]
  end
  subgraph inf [Inference plane]
    GW[Model gateway]
    GPU[GPU workers]
  end
  UI --> BFF --> OR --> GW --> GPU
  OR --> PG[(Conversation store)]
  OR --> VDB[(Vector index)]
    

Step 1 — Functional requirements (what users need)

Functional requirements describe behavior the product must ship. Missing one is a product bug, not a tuning exercise.

AreaRequirementWhy scale makes it hard
ChatMulti-turn conversation with streaming repliesEvery turn reloads context; long threads blow the context window
HistoryList, rename, delete conversations; resume on any deviceDurable store + hot cache; concurrent tabs
ModelsPick capability tier (fast vs capable)Different GPU pools, routing, and price per token
FilesUpload PDFs, images, spreadsheets for Q&AParsing, chunking, and vision encoders are separate heavy paths
ToolsBrowsing, code execution, plugins, custom GPT actionsMulti-step loops; each tool adds latency and failure modes
Memory (product)Optional “remember this about me” across chatsNeeds consent, deletion, and retrieval guardrails
AccountSign-in, subscription tier, usage limitsRate limits and billing must be consistent globally
Voice (optional)Speech in / speech outASR + TTS pipelines; not the same SLO as text

Functional details worth stating clearly

Idempotent turns. Clients send a request_id or idempotency key so retries after network blips do not double-charge or double-post assistant messages.

Tool loops are bounded. Orchestration caps steps (for example max 10 tool rounds) so a confused model cannot fork infinite jobs.

Out of scope today (say it aloud). Training the foundation model from scratch, federated on-device training, or real-time video generation—park them so the design stays focused.

Step 2 — Non-functional requirements (engineering promises)

CategoryTarget (typical)How we meet itIf we miss it
Latency — TTFTp95 < 1–3 s to first token (model dependent)Warm GPU pools, short queues, prompt compressionFeels “broken” even if total answer is fine
Latency — streamingSteady token cadence (no 5 s gaps)SSE flush, backpressure, avoid blocking on tools mid-streamUsers abandon mid-answer
Availability99.9%+ for chat API monthlyMulti-region gateways, model fallbacksGlobal outage headlines
DurabilityConversation log rarely lostWrite assistant draft to DB before closing streamTrust collapse (“it forgot our chat”)
PrivacyTLS in transit; retention policy; enterprise isolationEncryption, tenant-scoped indexes, deletion jobsRegulatory and PR risk
CostRevenue per user > inference + storageRouting to smaller models, caching, batchingUnsustainable burn
SafetyBlock policy violations before and after generationInput classifiers + output filters + abuse rate limitsHarmful content at scale

Key idea: Time-to-first-token is the UX metric users feel; tokens-per-second is the throughput metric finance feels. Optimize both, not one alone.

Step 3 — Napkin math (why one GPU is not enough)

Round numbers. Multiply in the open—you are showing magnitude, not audited financials.

Storage is cheaper than inference but not free: 500 bytes–2 KB metadata per message row × billions of rows → sharded SQL or NoSQL, tiered to cold storage for old threads.

Step 4 — Architecture: three planes

Draw clients on the left, stateless API replicas in the middle, GPU inference on the right. Under orchestration: Postgres (or similar) for transcripts, Redis for session hot state, object storage for uploads, vector DB for RAG. The inference plane is reached only through a gateway that owns keys, routing, retries, and metering.

flowchart TB
  subgraph clients [Clients]
    WEB[Web]
    IOS[iOS / Android]
    API[Third-party API]
  end
  subgraph edge [Orchestration]
    LB[Load balancer]
    CHAT[Chat BFF]
    ORCH[Orchestrator]
    MOD[Safety filters]
  end
  subgraph data [Data]
    PG[(Conversation DB)]
    R[("Redis hot context")]
    S3[(File uploads)]
    VEC[(Embeddings index)]
  end
  subgraph infer [Inference]
    GW[LLM gateway]
    PREFILL[Prefill workers]
    DECODE[Decode pool]
  end
  WEB --> LB
  IOS --> LB
  API --> LB
  LB --> CHAT --> ORCH
  ORCH --> MOD
  ORCH --> PG
  ORCH --> R
  ORCH --> S3
  ORCH --> VEC
  ORCH --> GW
  GW --> PREFILL
  GW --> DECODE
    

Step 5 — Walk one chat turn end to end

Follow a single user message through the system. Names are illustrative.

  1. Client sends POST /v1/chat/completions with stream: true, conversation_id, and the new user message.
  2. BFF validates JWT, checks subscription quota, loads rate-limit token bucket.
  3. Orchestrator fetches last N turns from Redis; on miss, rebuilds from Postgres. Optionally retrieves RAG chunks from the vector index.
  4. Prompt builder merges system policy, tool definitions, summaries of older turns, and the fresh user text into a token budget under the model limit.
  5. Safety — input runs classifiers (policy, PII, jailbreak). Hard block returns a refusal without burning a large GPU job.
  6. Gateway picks model route (for example gpt-4o vs gpt-4o-mini), enqueues on the inference scheduler.
  7. GPU worker runs prefill (process entire prompt in parallel) then decode (autoregressive tokens). KV cache stores attention keys/values so each new token does not re-read the full prompt from scratch.
  8. Stream emits SSE events: delta.content chunks until finish_reason: stop or tool_calls.
  9. Persist appends user + assistant rows to Postgres; updates Redis; emits usage record (input/output tokens) for billing.
  10. Safety — output may run async scans on the completed answer; revoke or flag if a late violation is detected.
stateDiagram-v2
  [*] --> Auth
  Auth --> LoadHistory: OK
  LoadHistory --> BuildPrompt
  BuildPrompt --> InputSafety
  InputSafety --> Blocked: violation
  InputSafety --> Inference: pass
  Inference --> Streaming
  Streaming --> ToolLoop: tool_calls
  ToolLoop --> BuildPrompt
  Streaming --> Persist: stop
  Persist --> [*]
  Blocked --> [*]
    

Step 6 — Context, memory, and summarization

Models have a fixed context window (for example 128k tokens). You cannot paste a ten-year transcript every turn. Production systems use a layered memory strategy:

Concurrency: two tabs appending to the same conversation_id need monotonic message_seq (compare-and-swap or DB transaction) so ordering stays coherent.

Step 7 — Inference serving: prefill, decode, KV cache

Transformer inference splits into two phases with different hardware appetites:

KV cache stores per-layer key/value tensors for prior tokens so decode steps avoid recomputing attention over the full prefix. Memory grows with sequence length × batch size—long contexts and large batches are why HBM size matters.

Continuous batching (vLLM, TensorRT-LLM, similar) dynamically adds and removes requests in a batch instead of waiting for every stream to finish—raising GPU utilization without sacrificing streaming UX.

Sanity check: If p95 TTFT spikes but decode tokens/s is flat, suspect queueing or prefill saturation—not “the model got dumber.”

Step 8 — Model gateway and routing

Every internal service should talk to a gateway, not directly to fifteen vendor endpoints. The gateway owns:

Speculative routing: try a small model first for classification or draft; escalate only when confidence is low—trading quality for cost on easy prompts.

Step 9 — RAG, tools, and agent loops

RAG (retrieval-augmented generation) grounds answers in your documents: ingest → chunk → embed → index → retrieve top-k at query time → inject into prompt. The data platform (ingestion workers, embedding jobs) is off the hot path but must stay fresh; stale indexes silently lie.

Tools expose structured actions (search web, run Python, call CRM). The model emits tool_calls; orchestration executes them and feeds tool role messages back—another prefill/decode cycle.

PatternWhen to useCost driver
Single-shot chatFAQ, draftingOne prefill + decode
RAGEnterprise docs, supportEmbedding search + longer prompt
Agent loopMulti-step researchN × (tool latency + inference)

Step 10 — Streaming on the wire (SSE)

Chat UIs use Server-Sent Events over HTTP: the client holds a long-lived response; the server pushes lines like data: {...}. Each chunk carries a JSON fragment with choices[0].delta.content. A terminal event sends [DONE] or finish_reason.

Backpressure: if the client renders slowly, TCP buffers fill; the server should pause pulling from the GPU stream to avoid unbounded memory. Heartbeat comments (: ping) keep intermediaries from closing idle connections.

HTTP/1.1 200 OK
Content-Type: text/event-stream
Cache-Control: no-cache
Connection: keep-alive

data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"role":"assistant"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc","choices":[{"index":0,"delta":{"content":"Hello"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}

data: [DONE]

Step 11 — Safety and moderation pipeline

Safety is a pipeline, not a single model card disclaimer:

  1. Input filters — jailbreak patterns, CSAM hashes, malware in uploads.
  2. Policy in system prompt — refusals for disallowed advice categories.
  3. Output filters — classifiers on completed or partial text; truncate stream on violation.
  4. Abuse controls — IP/device rate limits, CAPTCHA ladders, account bans.
  5. Human review queue — sampled conversations for regression testing after model updates.

Log decision codes (blocked_input, blocked_output, allowed) with trace ids—not raw secrets—for audit without storing full prompts where policy forbids it.

Step 12 — Conversation storage and consistency

DataStoreAccess pattern
Transcript rowsSharded SQL / NoSQLAppend by conversation_id; paginate history
Hot contextRedisGET last K messages; TTL on idle chats
UploadsObject storageLarge blobs; virus scan async
EmbeddingsVector DBANN search by user/tenant id filter
Usage / billingColumnar warehouseAppend-only events from gateway

Invariants: (conversation_id, message_seq) unique; request_id dedupes retries; soft-delete conversations with tombstones so backups and search indexes converge.

Step 13 — Multimodal side paths

Images, audio, and generated pictures are side doors—do not run them through the same hot path as a one-line text reply unless you accept very different SLOs.

flowchart LR
  TEXT[Text chat] --> ORCH[Orchestrator]
  IMG[Image upload] --> ENC[Vision encoder] --> ORCH
  MIC[Voice] --> ASR[ASR service] --> ORCH
  ORCH --> GW[Text LLM gateway]
    

Step 14 — Scale: queues, rate limits, and fleet growth

Step 15 — Technical layer: APIs and payloads

Public products expose an OpenAI-compatible Chat Completions surface. Internal traffic is often gRPC between BFF, orchestrator, and gateway.

OperationHTTPSuccessNotes
Create completion POST /v1/chat/completions 200 (stream or JSON body) Authorization: Bearer; body includes messages[], model, stream
List models GET /v1/models 200 Capability discovery for clients
Upload file (RAG) POST /v1/files 200 + file_id Async processing before attachable to assistant
Rate limited any 429 Headers x-ratelimit-remaining, retry-after

Non-streaming request (illustrative):

POST /v1/chat/completions HTTP/1.1
Host: api.example.com
Authorization: Bearer sk-…
Content-Type: application/json
X-Request-Id: 7f3c2a1b-…

{
  "model": "gpt-4o",
  "conversation_id": "conv_01HQ…",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Explain KV cache in two paragraphs."}
  ],
  "stream": false,
  "max_tokens": 512,
  "temperature": 0.7
}

Tool call fragment in a stream chunk:

data: {"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"id":"call_abc","type":"function","function":{"name":"search_web","arguments":"{\"q\":\""}}]}}]}

Core tables (logical schema)

users(id, plan, created_at)
conversations(id, user_id, title, created_at, deleted_at)
messages(id, conversation_id, seq, role, content_json, token_count, created_at)
usage_events(id, user_id, model, input_tokens, output_tokens, request_id, ts)
files(id, user_id, object_key, status, sha256)

Step 16 — Reliability, observability, and cost

Reliability

Observability

Cost controls

Step 17 — Goals → knobs (quick reference)

GoalKnob
Answers feel instantSmaller default model, regional GPUs, prompt compression, cap tools
High quality on hard tasksRoute to larger model; allow longer decode; multi-step agent with budget
Stays profitableToken metering, batching, distillation, aggressive caching of RAG
Grounded in company dataFresh RAG index, citations in UI, permission filters on retrieval
Safe at scaleLayered moderation, rate limits, human eval on model version bumps

Step 18 — Close the loop (what to practice)

On a whiteboard: three planes, one chat turn, label Postgres vs Redis vs GPU gateway on each step.

Out loud: five functional requirements and which non-functional target applies to orchestration vs inference.

With the technical section: trace POST /v1/chat/completions with stream: true through SSE to persistence.

The one line to remember

ChatGPT-class products are three systems behind one text box: orchestration (who you are, what you said, what tools apply), inference (GPUs turning tokens into language), and experience (streaming UI that hides the machinery). Draw the plane boundaries first and the rest of the design stays teachable at global scale.