Real-time RAG with sub-200ms retrieval at 50M chunks

Users want Google-like speed while you search 50 million chunks. The constraint is usually retrieval p95 < 200ms—not the full LLM answer (that streams afterward). This guide walks every decision: vector store, index geometry, caches, async layout, and where you sacrifice recall for latency.

Scenario

Design a real-time RAG system with sub-200ms response time.

Your retrieval corpus has 50 million chunks. Users expect Google-like response speed. Walk through every architectural decision—vector DB choice, indexing strategy, caching layers, async patterns, and where you’d trade recall for latency.

What you should be able to do after reading:

Split the SLA into a latency budget and defend what fits in 200ms.
Pick index types and parameters (HNSW, IVF, PQ) for 50M vectors without hand-waving.
Layer caches so hot queries never touch cold disk.
Explain two-stage retrieval and when you drop hybrid search or reranking.

Step 0 — Define what “sub-200ms” means

Clarify with stakeholders before sizing hardware:

Metric	Typical target	Includes
Retrieval SLA	p95 < 200ms	Embed query + ANN search + fetch chunk metadata (no LLM)
Time to first token	400ms–1.5s	Retrieval + prompt build + LLM cold start
Full answer	2–8s streamed	Not part of the 200ms bar

Phrase that lands well: “Google-like” means results feel instant—you return ranked chunk ids in 200ms and stream the narrative while the user is already reading snippets.

Step 1 — Clarifying questions

Question	Drives
QPS and burst factor?	Shard count, replica fanout
Filters (tenant, ACL, time)?	Pre-filter vs post-filter latency cliff
Embedding dim and model?	768 vs 1536 changes RAM by 2×
Must hybrid lexical run in 200ms?	Often no—vector-only fast path + async lexical merge
Acceptable recall@10?	Sets HNSW `ef_search` and whether you use PQ
Multi-region?	Read replicas per region vs single global index

Step 2 — The sixty-second answer

Shard 50M chunks into ~25–50 vector partitions (2M vectors/shard), each with in-memory HNSW (or GPU ANN) and scalar/PQ compression for tail cold storage. Query path: embed from cache (5–15ms) → parallel coarse search on all shards (40–80ms) → merge top 100 (5ms) → optional exact re-rank on 20 ids (30–50ms) → return chunk ids. Lexical and cross-encoder rerank run async or on a “deep search” button—not on the hot path.

Caching: normalized query → embedding; popular queries → result set; hot shards pinned in RAM. Recall trades: lower ef_search, IVF probing, skip hybrid on default, two-stage retrieve, smaller k.

Step 3 — Napkin math (50M chunks)

50M × 768-dim × 4 bytes ≈ 150 GB raw vectors (1536-dim → ~300 GB).
HNSW graph overhead ~1.2–1.5× → plan 200–450 GB working set if fully in RAM—or PQ + on-disk with RAM for hot centroids only.
At 1k QPS, 50M ANN probes/sec if naive—must shard so each query touches ~2M vectors per shard group in parallel, not 50M serially.
200ms budget example: embed 15ms + ANN 90ms + merge 10ms + metadata 25ms + margin 60ms.

Step 4 — Latency budget (defend every millisecond)

Stage	Budget	Notes
API + auth	5–10ms	Edge JWT validation; no DB hit
Query normalize + cache lookup	2–5ms	Lowercase, strip, hash
Query embedding	5–20ms	GPU batch microservice; cache hit → 0ms
ANN search (all shards parallel)	60–120ms	p95 driver; tune ef_search
Cross-shard merge	5–15ms	Heap merge top-k
Fetch chunk metadata	15–40ms	Redis/doc store by id batch get
Optional exact re-score	0–50ms	Only top 20–40 with full vectors

Step 5 — End-to-end architecture

flowchart LR
  Q[User query] --> GW[API gateway]
  GW --> QC[Query cache]
  QC --> EMB[Embed service]
  EMB --> RT[Retrieval orchestrator]
  subgraph shards [Sharded ANN - parallel]
    S1[Shard 1 HNSW]
    S2[Shard 2 HNSW]
    SN[Shard N HNSW]
  end
  RT --> S1
  RT --> S2
  RT --> SN
  S1 --> MERGE[Top-k merge]
  S2 --> MERGE
  SN --> MERGE
  MERGE --> META[Chunk metadata Redis]
  META --> FAST[Sub-200ms response]
  MERGE -. optional .-> DEEP[Async deep path]
  DEEP --> LEX[Lexical index]
  DEEP --> RER[Cross-encoder]
  FAST --> UI[UI snippets]
  RER --> LLM[LLM stream - async]

Step 6 — Vector DB choice (at 50M scale)

No single “best” product—pick against ops model and filter needs:

Option	Strengths at 50M	Watch-outs
Milvus / Zilliz	Sharding, IVF+HNSW, disk index, filters	Tune collection params; ops learning curve
Qdrant	Fast filtered HNSW, good DX, quantization	Plan cluster RAM for hot collections
Weaviate	Hybrid modules, multi-tenancy	Hybrid on hot path can blow 200ms—split paths
Pinecone / managed	Low ops, pod sizing	Cost at 50M×1536; vendor lock-in
Self-hosted HNSW (Faiss/ScaNN)	Maximum control, lowest $ at scale	You own sharding, replication, upgrades
pgvector	Simple if corpus is small	Poor default for 50M at <200ms—use as metadata sidecar only

Recommendation to state in a design session: managed or Milvus/Qdrant cluster with explicit sharding by tenant or content hash, vectors on fast storage, metadata in Redis/Postgres.

Step 7 — Indexing strategy

Sharding

Shard key: hash(chunk_id) mod N or tenant_id so queries with tenant filter hit 1–3 shards not 50.
Target 1–3M vectors per shard for stable HNSW latency.

Index types

Index	When	Latency	Recall
HNSW	Default hot path	Lowest at fixed RAM	High if ef_search tuned
IVF + PQ	Cold/archive tiers	Very fast, less RAM	Lower—use for stage-1 coarse
DiskANN / on-disk HNSW	Corpus too big for RAM	Moderate	Good with NVMe
Brute force on 20 ids	After ANN shortlist	Cheap exact rerank	Recovers PQ loss

Parameters you name explicitly

HNSW: M=16–32, ef_construction=200, ef_search=64–128 (tune down for speed)
IVF: nlist ≈ sqrt(N_shard), nprobe=8–16 (lower nprobe = faster)
PQ: m=48 subquantizers for 768-dim (stage-1 only)

Embeddings offline

Precompute all 50M embeddings in batch; never embed the corpus at query time. Version embeddings; blue/green collection per model version.

Step 8 — Caching layers (biggest lever for “Google-like”)

Layer	Key	TTL	Hit effect
L1 — Query result	hash(normalized query + filters + corpus_version)	5–60 min	Full path < 5ms
L2 — Query embedding	same hash	hours	Skip embed GPU (15ms saved)
L3 — Hot shard graphs	pin in RAM / GPU	n/a	Avoid disk ANN tail latency
L4 — Chunk metadata	chunk_id	long	Batch MGET 20 keys < 3ms in Redis
L5 — Suggest / prefix	typed prefix	short	Feels instant like Google Suggest

Invalidate L1/L2 on corpus_version bump—not wall clock alone.

Step 9 — Async patterns (what not to block on)

Parallel shard RPCs with 150ms deadline; partial results OK if SLA strict (degrade shard count).
Return fast path first: top 10 vector hits + titles/snippets to UI in <200ms.
Fork deep path: lexical BM25, cross-encoder rerank, LLM—client merges when ready (skeleton UI).
Embed service: continuous batching on GPU; separate pool from ANN nodes.
Prefetch: on keystroke pause, speculative embed + retrieve before Enter.
Write path async: new docs → queue → index; never block read path on ingest.

Step 10 — Trading recall for latency (be explicit)

Trade	Latency win	Recall cost	When acceptable
Lower `ef_search` (64 vs 256)	Large	Miss long-tail relevant chunks	Consumer search; head queries cached
IVF with low `nprobe`	Large	Coarse clusters wrong	Stage-1 only + exact rerank top 20
PQ compressed vectors	RAM / disk bandwidth	Distance error	With brute-force rerank on shortlist
Skip hybrid lexical on default	50–150ms	Miss exact SKU / error codes	Offer “Exact match” mode
Skip cross-encoder rerank	100–300ms	Wrong order in top 10	Snippet UI tolerates good-enough
Smaller k (10 vs 50)	Merge + metadata fetch	LLM context thinner	Fetch more only for LLM path async
Aggressive ACL pre-filter	Smaller search space	Complex filters slow if badly indexed	Partition by tenant at shard key
Approximate metadata	Skip store round-trip	Stale titles	Only for suggest, not compliance answers

Two-stage pattern (say this clearly): Stage 1 = fast ANN (recall 0.85). Stage 2 = exact dot product or tiny reranker on 30 ids (recall back to 0.95) still under 200ms if stage 2 is bounded.

Step 11 — Filters and multi-tenancy without killing p95

Bad: retrieve top 500 globally, then filter to 10—wastes ANN and leaks titles in logs.
Good: shard by tenant_id; HNSW per tenant for enterprise; or bitmap filter inside engine if native.
Post-filter only when selectivity is high (<5% of corpus)—else partition index.

Step 12 — Failure points

Failure	Symptom	Mitigation
Hot shard	One tenant spikes p95	Isolate noisy tenant; rate limit; dedicated replica
Cache stampede	Thundering herd on popular query	Single-flight; stale-while-revalidate
Stale cache after ingest	Wrong results feel “fast”	corpus_version in cache key; partial invalidation
Shard timeout	Partial recall drop	Merge available shards; alert; don’t block whole query
Embedding service backlog	200ms blown	Separate pool; fallback to smaller distilled embed model
Over-tuned for speed	Product complaints on miss	Offline recall@k dashboard; “deep search” path

Step 13 — Observability

Per-stage histogram: embed, per-shard ANN, merge, metadata.
Recall@k sampled offline nightly with frozen query set.
Cache hit ratio L1/L2; cost per 1k queries.
Track slow query log (>200ms) with filter cardinality and ef_search used.

Step 14 — How to walk through this in a design session

2 min — define 200ms = retrieval only; show budget table.
5 min — 50M math → sharding mandatory.
8 min — diagram: parallel shards, caches, async deep path.
10 min — index choice + ef_search / two-stage recall recovery.
5 min — caching layers L1–L5.
5 min — recall vs latency table + when hybrid is off hot path.
Close — “Fast feels instant; perfect is async.”

Step 15 — Goals → knobs

Goal	Knob
Lower p95	↓ ef_search, ↓ nprobe, more caching, fewer shards per query
Higher recall	↑ ef_search, two-stage exact rerank, hybrid deep path
Lower cost	PQ + disk tier; smaller embed dim; fewer RAM replicas
Higher QPS	Read replicas, embed batching, result cache

The one line to remember

Sub-200ms RAG at 50M chunks is a sharded ANN + cache problem, not an LLM problem: parallel coarse retrieval, aggressive caching, bounded rerank—and push everything that needs perfection to an async deep path users still perceive as instant.