Real-time RAG with sub-200ms retrieval at 50M chunks
Users want Google-like speed while you search 50 million chunks. The constraint is usually retrieval p95 < 200ms—not the full LLM answer (that streams afterward). This guide walks every decision: vector store, index geometry, caches, async layout, and where you sacrifice recall for latency.
Scenario
Design a real-time RAG system with sub-200ms response time.
Your retrieval corpus has 50 million chunks. Users expect Google-like response speed. Walk through every architectural decision—vector DB choice, indexing strategy, caching layers, async patterns, and where you’d trade recall for latency.
What you should be able to do after reading:
- Split the SLA into a latency budget and defend what fits in 200ms.
- Pick index types and parameters (HNSW, IVF, PQ) for 50M vectors without hand-waving.
- Layer caches so hot queries never touch cold disk.
- Explain two-stage retrieval and when you drop hybrid search or reranking.
Step 0 — Define what “sub-200ms” means
Clarify with stakeholders before sizing hardware:
| Metric | Typical target | Includes |
|---|---|---|
| Retrieval SLA | p95 < 200ms | Embed query + ANN search + fetch chunk metadata (no LLM) |
| Time to first token | 400ms–1.5s | Retrieval + prompt build + LLM cold start |
| Full answer | 2–8s streamed | Not part of the 200ms bar |
Phrase that lands well: “Google-like” means results feel instant—you return ranked chunk ids in 200ms and stream the narrative while the user is already reading snippets.
Step 1 — Clarifying questions
| Question | Drives |
|---|---|
| QPS and burst factor? | Shard count, replica fanout |
| Filters (tenant, ACL, time)? | Pre-filter vs post-filter latency cliff |
| Embedding dim and model? | 768 vs 1536 changes RAM by 2× |
| Must hybrid lexical run in 200ms? | Often no—vector-only fast path + async lexical merge |
| Acceptable recall@10? | Sets HNSW ef_search and whether you use PQ |
| Multi-region? | Read replicas per region vs single global index |
Step 2 — The sixty-second answer
Shard 50M chunks into ~25–50 vector partitions (2M vectors/shard), each with in-memory HNSW (or GPU ANN) and scalar/PQ compression for tail cold storage. Query path: embed from cache (5–15ms) → parallel coarse search on all shards (40–80ms) → merge top 100 (5ms) → optional exact re-rank on 20 ids (30–50ms) → return chunk ids. Lexical and cross-encoder rerank run async or on a “deep search” button—not on the hot path.
Caching: normalized query → embedding; popular queries → result set; hot shards pinned in RAM.
Recall trades: lower ef_search, IVF probing, skip hybrid on default, two-stage retrieve, smaller k.
Step 3 — Napkin math (50M chunks)
- 50M × 768-dim × 4 bytes ≈ 150 GB raw vectors (1536-dim → ~300 GB).
- HNSW graph overhead ~1.2–1.5× → plan 200–450 GB working set if fully in RAM—or PQ + on-disk with RAM for hot centroids only.
- At 1k QPS, 50M ANN probes/sec if naive—must shard so each query touches ~2M vectors per shard group in parallel, not 50M serially.
- 200ms budget example: embed 15ms + ANN 90ms + merge 10ms + metadata 25ms + margin 60ms.
Step 4 — Latency budget (defend every millisecond)
| Stage | Budget | Notes |
|---|---|---|
| API + auth | 5–10ms | Edge JWT validation; no DB hit |
| Query normalize + cache lookup | 2–5ms | Lowercase, strip, hash |
| Query embedding | 5–20ms | GPU batch microservice; cache hit → 0ms |
| ANN search (all shards parallel) | 60–120ms | p95 driver; tune ef_search |
| Cross-shard merge | 5–15ms | Heap merge top-k |
| Fetch chunk metadata | 15–40ms | Redis/doc store by id batch get |
| Optional exact re-score | 0–50ms | Only top 20–40 with full vectors |
Step 5 — End-to-end architecture
flowchart LR
Q[User query] --> GW[API gateway]
GW --> QC[Query cache]
QC --> EMB[Embed service]
EMB --> RT[Retrieval orchestrator]
subgraph shards [Sharded ANN - parallel]
S1[Shard 1 HNSW]
S2[Shard 2 HNSW]
SN[Shard N HNSW]
end
RT --> S1
RT --> S2
RT --> SN
S1 --> MERGE[Top-k merge]
S2 --> MERGE
SN --> MERGE
MERGE --> META[Chunk metadata Redis]
META --> FAST[Sub-200ms response]
MERGE -. optional .-> DEEP[Async deep path]
DEEP --> LEX[Lexical index]
DEEP --> RER[Cross-encoder]
FAST --> UI[UI snippets]
RER --> LLM[LLM stream - async]
Step 6 — Vector DB choice (at 50M scale)
No single “best” product—pick against ops model and filter needs:
| Option | Strengths at 50M | Watch-outs |
|---|---|---|
| Milvus / Zilliz | Sharding, IVF+HNSW, disk index, filters | Tune collection params; ops learning curve |
| Qdrant | Fast filtered HNSW, good DX, quantization | Plan cluster RAM for hot collections |
| Weaviate | Hybrid modules, multi-tenancy | Hybrid on hot path can blow 200ms—split paths |
| Pinecone / managed | Low ops, pod sizing | Cost at 50M×1536; vendor lock-in |
| Self-hosted HNSW (Faiss/ScaNN) | Maximum control, lowest $ at scale | You own sharding, replication, upgrades |
| pgvector | Simple if corpus is small | Poor default for 50M at <200ms—use as metadata sidecar only |
Recommendation to state in a design session: managed or Milvus/Qdrant cluster with explicit sharding by tenant or content hash, vectors on fast storage, metadata in Redis/Postgres.
Step 7 — Indexing strategy
Sharding
- Shard key:
hash(chunk_id) mod Nortenant_idso queries with tenant filter hit 1–3 shards not 50. - Target 1–3M vectors per shard for stable HNSW latency.
Index types
| Index | When | Latency | Recall |
|---|---|---|---|
| HNSW | Default hot path | Lowest at fixed RAM | High if ef_search tuned |
| IVF + PQ | Cold/archive tiers | Very fast, less RAM | Lower—use for stage-1 coarse |
| DiskANN / on-disk HNSW | Corpus too big for RAM | Moderate | Good with NVMe |
| Brute force on 20 ids | After ANN shortlist | Cheap exact rerank | Recovers PQ loss |
Parameters you name explicitly
HNSW: M=16–32, ef_construction=200, ef_search=64–128 (tune down for speed) IVF: nlist ≈ sqrt(N_shard), nprobe=8–16 (lower nprobe = faster) PQ: m=48 subquantizers for 768-dim (stage-1 only)
Embeddings offline
Precompute all 50M embeddings in batch; never embed the corpus at query time. Version embeddings; blue/green collection per model version.
Step 8 — Caching layers (biggest lever for “Google-like”)
| Layer | Key | TTL | Hit effect |
|---|---|---|---|
| L1 — Query result | hash(normalized query + filters + corpus_version) | 5–60 min | Full path < 5ms |
| L2 — Query embedding | same hash | hours | Skip embed GPU (15ms saved) |
| L3 — Hot shard graphs | pin in RAM / GPU | n/a | Avoid disk ANN tail latency |
| L4 — Chunk metadata | chunk_id | long | Batch MGET 20 keys < 3ms in Redis |
| L5 — Suggest / prefix | typed prefix | short | Feels instant like Google Suggest |
Invalidate L1/L2 on corpus_version bump—not wall clock alone.
Step 9 — Async patterns (what not to block on)
- Parallel shard RPCs with 150ms deadline; partial results OK if SLA strict (degrade shard count).
- Return fast path first: top 10 vector hits + titles/snippets to UI in <200ms.
- Fork deep path: lexical BM25, cross-encoder rerank, LLM—client merges when ready (skeleton UI).
- Embed service: continuous batching on GPU; separate pool from ANN nodes.
- Prefetch: on keystroke pause, speculative embed + retrieve before Enter.
- Write path async: new docs → queue → index; never block read path on ingest.
Step 10 — Trading recall for latency (be explicit)
| Trade | Latency win | Recall cost | When acceptable |
|---|---|---|---|
Lower ef_search (64 vs 256) | Large | Miss long-tail relevant chunks | Consumer search; head queries cached |
IVF with low nprobe | Large | Coarse clusters wrong | Stage-1 only + exact rerank top 20 |
| PQ compressed vectors | RAM / disk bandwidth | Distance error | With brute-force rerank on shortlist |
| Skip hybrid lexical on default | 50–150ms | Miss exact SKU / error codes | Offer “Exact match” mode |
| Skip cross-encoder rerank | 100–300ms | Wrong order in top 10 | Snippet UI tolerates good-enough |
| Smaller k (10 vs 50) | Merge + metadata fetch | LLM context thinner | Fetch more only for LLM path async |
| Aggressive ACL pre-filter | Smaller search space | Complex filters slow if badly indexed | Partition by tenant at shard key |
| Approximate metadata | Skip store round-trip | Stale titles | Only for suggest, not compliance answers |
Two-stage pattern (say this clearly): Stage 1 = fast ANN (recall 0.85). Stage 2 = exact dot product or tiny reranker on 30 ids (recall back to 0.95) still under 200ms if stage 2 is bounded.
Step 11 — Filters and multi-tenancy without killing p95
- Bad: retrieve top 500 globally, then filter to 10—wastes ANN and leaks titles in logs.
- Good: shard by
tenant_id; HNSW per tenant for enterprise; or bitmap filter inside engine if native. - Post-filter only when selectivity is high (<5% of corpus)—else partition index.
Step 12 — Failure points
| Failure | Symptom | Mitigation |
|---|---|---|
| Hot shard | One tenant spikes p95 | Isolate noisy tenant; rate limit; dedicated replica |
| Cache stampede | Thundering herd on popular query | Single-flight; stale-while-revalidate |
| Stale cache after ingest | Wrong results feel “fast” | corpus_version in cache key; partial invalidation |
| Shard timeout | Partial recall drop | Merge available shards; alert; don’t block whole query |
| Embedding service backlog | 200ms blown | Separate pool; fallback to smaller distilled embed model |
| Over-tuned for speed | Product complaints on miss | Offline recall@k dashboard; “deep search” path |
Step 13 — Observability
- Per-stage histogram: embed, per-shard ANN, merge, metadata.
- Recall@k sampled offline nightly with frozen query set.
- Cache hit ratio L1/L2; cost per 1k queries.
- Track slow query log (>200ms) with filter cardinality and ef_search used.
Step 14 — How to walk through this in a design session
- 2 min — define 200ms = retrieval only; show budget table.
- 5 min — 50M math → sharding mandatory.
- 8 min — diagram: parallel shards, caches, async deep path.
- 10 min — index choice + ef_search / two-stage recall recovery.
- 5 min — caching layers L1–L5.
- 5 min — recall vs latency table + when hybrid is off hot path.
- Close — “Fast feels instant; perfect is async.”
Step 15 — Goals → knobs
| Goal | Knob |
|---|---|
| Lower p95 | ↓ ef_search, ↓ nprobe, more caching, fewer shards per query |
| Higher recall | ↑ ef_search, two-stage exact rerank, hybrid deep path |
| Lower cost | PQ + disk tier; smaller embed dim; fewer RAM replicas |
| Higher QPS | Read replicas, embed batching, result cache |
The one line to remember
Sub-200ms RAG at 50M chunks is a sharded ANN + cache problem, not an LLM problem: parallel coarse retrieval, aggressive caching, bounded rerank—and push everything that needs perfection to an async deep path users still perceive as instant.