sharpbyte.dev

Real-time RAG with sub-200ms retrieval at 50M chunks

Users want Google-like speed while you search 50 million chunks. The constraint is usually retrieval p95 < 200ms—not the full LLM answer (that streams afterward). This guide walks every decision: vector store, index geometry, caches, async layout, and where you sacrifice recall for latency.

Scenario

Design a real-time RAG system with sub-200ms response time.

Your retrieval corpus has 50 million chunks. Users expect Google-like response speed. Walk through every architectural decision—vector DB choice, indexing strategy, caching layers, async patterns, and where you’d trade recall for latency.

What you should be able to do after reading:

Step 0 — Define what “sub-200ms” means

Clarify with stakeholders before sizing hardware:

MetricTypical targetIncludes
Retrieval SLAp95 < 200msEmbed query + ANN search + fetch chunk metadata (no LLM)
Time to first token400ms–1.5sRetrieval + prompt build + LLM cold start
Full answer2–8s streamedNot part of the 200ms bar

Phrase that lands well: “Google-like” means results feel instant—you return ranked chunk ids in 200ms and stream the narrative while the user is already reading snippets.

Step 1 — Clarifying questions

QuestionDrives
QPS and burst factor?Shard count, replica fanout
Filters (tenant, ACL, time)?Pre-filter vs post-filter latency cliff
Embedding dim and model?768 vs 1536 changes RAM by 2×
Must hybrid lexical run in 200ms?Often no—vector-only fast path + async lexical merge
Acceptable recall@10?Sets HNSW ef_search and whether you use PQ
Multi-region?Read replicas per region vs single global index

Step 2 — The sixty-second answer

Shard 50M chunks into ~25–50 vector partitions (2M vectors/shard), each with in-memory HNSW (or GPU ANN) and scalar/PQ compression for tail cold storage. Query path: embed from cache (5–15ms) → parallel coarse search on all shards (40–80ms) → merge top 100 (5ms) → optional exact re-rank on 20 ids (30–50ms) → return chunk ids. Lexical and cross-encoder rerank run async or on a “deep search” button—not on the hot path.

Caching: normalized query → embedding; popular queries → result set; hot shards pinned in RAM. Recall trades: lower ef_search, IVF probing, skip hybrid on default, two-stage retrieve, smaller k.

Step 3 — Napkin math (50M chunks)

Step 4 — Latency budget (defend every millisecond)

StageBudgetNotes
API + auth5–10msEdge JWT validation; no DB hit
Query normalize + cache lookup2–5msLowercase, strip, hash
Query embedding5–20msGPU batch microservice; cache hit → 0ms
ANN search (all shards parallel)60–120msp95 driver; tune ef_search
Cross-shard merge5–15msHeap merge top-k
Fetch chunk metadata15–40msRedis/doc store by id batch get
Optional exact re-score0–50msOnly top 20–40 with full vectors

Step 5 — End-to-end architecture

flowchart LR
  Q[User query] --> GW[API gateway]
  GW --> QC[Query cache]
  QC --> EMB[Embed service]
  EMB --> RT[Retrieval orchestrator]
  subgraph shards [Sharded ANN - parallel]
    S1[Shard 1 HNSW]
    S2[Shard 2 HNSW]
    SN[Shard N HNSW]
  end
  RT --> S1
  RT --> S2
  RT --> SN
  S1 --> MERGE[Top-k merge]
  S2 --> MERGE
  SN --> MERGE
  MERGE --> META[Chunk metadata Redis]
  META --> FAST[Sub-200ms response]
  MERGE -. optional .-> DEEP[Async deep path]
  DEEP --> LEX[Lexical index]
  DEEP --> RER[Cross-encoder]
  FAST --> UI[UI snippets]
  RER --> LLM[LLM stream - async]
    

Step 6 — Vector DB choice (at 50M scale)

No single “best” product—pick against ops model and filter needs:

OptionStrengths at 50MWatch-outs
Milvus / ZillizSharding, IVF+HNSW, disk index, filtersTune collection params; ops learning curve
QdrantFast filtered HNSW, good DX, quantizationPlan cluster RAM for hot collections
WeaviateHybrid modules, multi-tenancyHybrid on hot path can blow 200ms—split paths
Pinecone / managedLow ops, pod sizingCost at 50M×1536; vendor lock-in
Self-hosted HNSW (Faiss/ScaNN)Maximum control, lowest $ at scaleYou own sharding, replication, upgrades
pgvectorSimple if corpus is smallPoor default for 50M at <200ms—use as metadata sidecar only

Recommendation to state in a design session: managed or Milvus/Qdrant cluster with explicit sharding by tenant or content hash, vectors on fast storage, metadata in Redis/Postgres.

Step 7 — Indexing strategy

Sharding

Index types

IndexWhenLatencyRecall
HNSWDefault hot pathLowest at fixed RAMHigh if ef_search tuned
IVF + PQCold/archive tiersVery fast, less RAMLower—use for stage-1 coarse
DiskANN / on-disk HNSWCorpus too big for RAMModerateGood with NVMe
Brute force on 20 idsAfter ANN shortlistCheap exact rerankRecovers PQ loss

Parameters you name explicitly

HNSW: M=16–32, ef_construction=200, ef_search=64–128 (tune down for speed)
IVF: nlist ≈ sqrt(N_shard), nprobe=8–16 (lower nprobe = faster)
PQ: m=48 subquantizers for 768-dim (stage-1 only)

Embeddings offline

Precompute all 50M embeddings in batch; never embed the corpus at query time. Version embeddings; blue/green collection per model version.

Step 8 — Caching layers (biggest lever for “Google-like”)

LayerKeyTTLHit effect
L1 — Query resulthash(normalized query + filters + corpus_version)5–60 minFull path < 5ms
L2 — Query embeddingsame hashhoursSkip embed GPU (15ms saved)
L3 — Hot shard graphspin in RAM / GPUn/aAvoid disk ANN tail latency
L4 — Chunk metadatachunk_idlongBatch MGET 20 keys < 3ms in Redis
L5 — Suggest / prefixtyped prefixshortFeels instant like Google Suggest

Invalidate L1/L2 on corpus_version bump—not wall clock alone.

Step 9 — Async patterns (what not to block on)

  1. Parallel shard RPCs with 150ms deadline; partial results OK if SLA strict (degrade shard count).
  2. Return fast path first: top 10 vector hits + titles/snippets to UI in <200ms.
  3. Fork deep path: lexical BM25, cross-encoder rerank, LLM—client merges when ready (skeleton UI).
  4. Embed service: continuous batching on GPU; separate pool from ANN nodes.
  5. Prefetch: on keystroke pause, speculative embed + retrieve before Enter.
  6. Write path async: new docs → queue → index; never block read path on ingest.

Step 10 — Trading recall for latency (be explicit)

TradeLatency winRecall costWhen acceptable
Lower ef_search (64 vs 256)LargeMiss long-tail relevant chunksConsumer search; head queries cached
IVF with low nprobeLargeCoarse clusters wrongStage-1 only + exact rerank top 20
PQ compressed vectorsRAM / disk bandwidthDistance errorWith brute-force rerank on shortlist
Skip hybrid lexical on default50–150msMiss exact SKU / error codesOffer “Exact match” mode
Skip cross-encoder rerank100–300msWrong order in top 10Snippet UI tolerates good-enough
Smaller k (10 vs 50)Merge + metadata fetchLLM context thinnerFetch more only for LLM path async
Aggressive ACL pre-filterSmaller search spaceComplex filters slow if badly indexedPartition by tenant at shard key
Approximate metadataSkip store round-tripStale titlesOnly for suggest, not compliance answers

Two-stage pattern (say this clearly): Stage 1 = fast ANN (recall 0.85). Stage 2 = exact dot product or tiny reranker on 30 ids (recall back to 0.95) still under 200ms if stage 2 is bounded.

Step 11 — Filters and multi-tenancy without killing p95

Step 12 — Failure points

FailureSymptomMitigation
Hot shardOne tenant spikes p95Isolate noisy tenant; rate limit; dedicated replica
Cache stampedeThundering herd on popular querySingle-flight; stale-while-revalidate
Stale cache after ingestWrong results feel “fast”corpus_version in cache key; partial invalidation
Shard timeoutPartial recall dropMerge available shards; alert; don’t block whole query
Embedding service backlog200ms blownSeparate pool; fallback to smaller distilled embed model
Over-tuned for speedProduct complaints on missOffline recall@k dashboard; “deep search” path

Step 13 — Observability

Step 14 — How to walk through this in a design session

  1. 2 min — define 200ms = retrieval only; show budget table.
  2. 5 min — 50M math → sharding mandatory.
  3. 8 min — diagram: parallel shards, caches, async deep path.
  4. 10 min — index choice + ef_search / two-stage recall recovery.
  5. 5 min — caching layers L1–L5.
  6. 5 min — recall vs latency table + when hybrid is off hot path.
  7. Close — “Fast feels instant; perfect is async.”

Step 15 — Goals → knobs

GoalKnob
Lower p95↓ ef_search, ↓ nprobe, more caching, fewer shards per query
Higher recall↑ ef_search, two-stage exact rerank, hybrid deep path
Lower costPQ + disk tier; smaller embed dim; fewer RAM replicas
Higher QPSRead replicas, embed batching, result cache

The one line to remember

Sub-200ms RAG at 50M chunks is a sharded ANN + cache problem, not an LLM problem: parallel coarse retrieval, aggressive caching, bounded rerank—and push everything that needs perfection to an async deep path users still perceive as instant.