sharpbyte.dev
← Design guide
Interview ready · Design · Section 3

RAG system design

Twenty interview-style scenarios on retrieval-augmented generation at real complexity: ingestion at hundreds of thousands of documents, incremental and CDC-shaped freshness, structure-aware chunking, hybrid lexical+vector fusion, staged pipelines, multi-hop and multilingual retrieval, ACL-bound corpora, evaluation discipline, user feedback loops, massive contracts, GraphRAG, honest citations, conflict handling, support stacks, routers, self-healing retrieval, and SQL+RAG hybrids. Answers read like staff-engineering narratives—tradeoffs, failure modes, and what you would measure—not bullet memes.

Interview stance. When they say “design RAG,” narrate end-to-end data ownership: who is accountable when a wrong clause ships? Ground answers in retrieval metrics, freshness SLAs, and enforcement mechanics—not vibe checks.

31. How would you design a production-grade RAG system from scratch? Walk through every component.

In interviews, weak answers jump straight to “we embed and call GPT.” Strong answers treat RAG as a productized data path: authoritative sources, parsing quality, retrieval metrics, grounding rules, and reruns when something drifts. The LLM is the last mile; most incidents come from bad chunks, stale indexes, or missing ACLs—not from the choice of GPT-4 vs GPT-5.

1) Connectors & scheduling. You need repeatable pulls from wikis, blob storage, ticketing systems, or database exports. Each connector should capture cursor or revision state, respect API rate limits, and record last successful sync for operators. Failures should be visible in a dashboard, not only in logs.

2) Raw object store. Keep immutable originals (PDF bytes, HTML snapshots, export files) with a content hash. When your PDF parser improves next quarter, you reprocess from raw without asking teams to re-upload. This also helps legal holds and audit (“what did we index on March 3?”).

3) Parse & normalize. Layout-aware parsing preserves headings, tables, and code blocks—flattening everything to plain text often destroys table structure that users ask about. Run language detection, optional OCR for scans, and policy-aware PII scrubbing before text ever reaches embeddings if required by contract.

4) Chunking as a contract. Define a chunk schema: stable chunk_id, doc_id, source_revision, heading breadcrumb, acl_principal list or tenant tags, and offsets. Chunks are your unit of retrieval, re-embed, delete, and cite. If chunk boundaries wander randomly between pipeline versions, your evals becomes non-comparable.

5) Embeddings & indexes. Pick embedding model + dimension deliberately; changing it is a migration. Store vectors in a service that supports metadata filtering (tenant, product, doc_type). Add BM25 or similar sparse retrieval when users query with SKUs, error codes, or rare jargon that dense models blur together.

6) Query-time stack. Typical path: optional query rewrite or HyDE → hybrid recall (e.g., top 50–200) → hard metadata filters (never post-filter only in the prompt) → cross-encoder or lightweight reranker on 20–50 candidates → MMR / packing into a token budget with numbered evidence slots for citations.

7) Generation & citations. Instruct the model to Quote only numbered passages; add a validator that citations exist and (where possible) spot-check substring overlap. For high-risk domains, reject answers that cite nothing or invent URLs—fail closed to human review or “insufficient evidence.”

8) Evaluation & governance. Offline labeled sets for recall@k/MRR by segment; online satisfaction, citation clicks, and escalation reasons. Version together: prompt_id, embedding_model_id, index_build_id. When quality drops, you can diff exactly what changed.

Production RAG — major components

flowchart LR
  SRC[Sources] --> RAW[Raw store]
  RAW --> PARSE[Parse chunk]
  PARSE --> EMB[Embed]
  EMB --> VEC[(Vector index)]
  PARSE --> SP[(Sparse index)]
  Q[User query] --> RET[Retrieve rerank]
  VEC --> RET
  SP --> RET
  RET --> GEN[LLM + citations]
            

32. How would you design the document ingestion pipeline for a RAG system that processes 100,000 PDFs?

This question tests whether you think like a data engineer, not a notebook author. 100k PDFs implies long wall-clock runs, partial failures, and operators who need progress, replays, and cost forecasts. Assume heterogeneous PDFs: text-native finance reports, slide decks with tiny fonts, and scanned contracts—one code path will not fit.

Queue & work distribution. Write manifest rows (object key, tenant, priority, content_type guess) into a durable queue (SQS, Kafka, Celery+Redis). Workers extend visibility timeouts for big files so messages are not redelivered mid-parse while another worker duplicates work. Shard work by tenant to isolate noisy neighbors.

Memory & streaming. Stream bytes to disk or temp storage; never hold hundreds of MB PDFs fully in RAM per worker. Cap per-file CPU time; killer PDFs go to a poison / quarantine bucket with stack traces and a hash so security can scan for malware separately.

Idempotency. Upsert chunks by deterministic ids derived from (doc_id, chunk_index, parser_version) or similar. Retries must not duplicate vectors or leave orphan chunks—use tombstones or transactional delete-then-insert patterns your vector DB supports.

OCR branch. Detect scan vs text (e.g., font operators vs bitmap heuristics). OCR is slow—fan it to GPU OCR workers or a managed service, and cache OCR text keyed by page image hash. Log OCR confidence; low-confidence pages may be excluded from auto-answer flows.

Embedding batches. Embedding dominates variable cost. Batch encode on GPU inference pods; expose a metric docs embedded / hour and queue lag. Consider nightly acceleration windows when interactive RAG load is low.

Observability. Surface counts: attempted / succeeded / DLQ / avg parse seconds / OCR % / embed dollars spent. Alert on DLQ growth rate, not only absolute depth.

Example. A compliance team onboards 100k policies: days 1–3 are parse+QC sampling; embeddings run in spot GPU pools overnight; customer-visible search flips only after a gold-query regression passes on the new index snapshot.

33. How would you handle incremental updates to the knowledge base in a RAG system without full re-indexing?

Full reindexes are expensive and risky during business hours. Incremental updates hinge on a stable identity story: you must know which logical document changed, not only that some blob landed in S3 with a new name.

Stable doc ids & hashes. Prefer upstream ids (Confluence page id, Git blob, Salesforce attachment id). Compute a content hash after canonical normalization; if unchanged, skip all downstream work—this alone saves enormous embed spend in wiki-style churn.

Chunk diffing. When text changes, re-chunk that document only. Map old chunk_ids to new ones; delete vectors for chunks whose span disappeared. For some stores you delete_by_filter(doc_id=…) then bulk upsert fresh vectors in one job to avoid gaps where queries hit empty sets.

Tombstones & deletes. If a document is removed from the source, propagation must delete or mark inactive in hours, not weeks—otherwise you serve ghost citations. Hook deletes from CMS webhooks as seriously as creates.

Version skew. During re-embed, dual-read or serve the prior snapshot until the new one passes parity checks on a small golden query set. Flip a router flag atomically—users should never see half-old/half-new index unless you intentionally label answers “reindex in progress.”

Periodic sweep. Incremental systems miss edge cases ( failed partial writes). A weekly reconciliation job compares source inventory to vector index cardinality per doc_id and opens tickets for drift.

34. How would you design a RAG system that must stay in sync with a frequently changing data source (e.g., a live database or wiki)?

Start by asking how stale an answer may be. A pricing wiki that changes hourly needs different architecture than an HR handbook updated monthly. Your design should make freshness explicit in metadata and sometimes in the user-visible answer (“according to Confluence revision 4821”).

Ingest triggers. Prefer webhooks or change-data capture from the source over brute-force hourly crawls—CDC reduces load and narrows the diff you must process. When webhooks are flaky, add a slower reconciliation poll as safety net.

Debounce & coalesce. Wikis generate burst edits. Buffer events (e.g., 30–120s) so you re-embed once per page per window instead of 40 times in a minute. Cap writes per second to your vector service to avoid throttling and hot partitions.

Token economics. Re-embedding whole pages on every typo is wasteful. Use paragraph-level hashing inside a page where possible, or diff extracted text and only reprocess changed subtrees.

Relational sources. If numbers live in Postgres and prose in Notion, consider text-to-SQL or materialized views for metrics and RAG for narrative—prevents the support bot quoting revenue from a vector snapshot that lagged 20 minutes behind the ledger.

User-visible lag. Track now() - source_updated_at at serve time. If lag exceeds SLA, downgrade to “search only” or show a warning banner rather than silently issuing confident prose.

Hot source → search index

flowchart LR
  W[Wiki or DB] --> CDC[Webhook CDC poll]
  CDC --> DEB[Debounce batch]
  DEB --> EMB[Re-embed changed docs]
  EMB --> IDX[(Vector index)]
            

35. How would you design the chunking strategy for a RAG system that handles mixed content: tables, code snippets, prose, and diagrams?

Uniform sliding windows excel at blogs and fail at tables, APIs, and contracts. Interviewers want to hear structure-aware chunking: parse the document into typed blocks, then choose chunk rules per type.

Prose. Split on headings when the parser provides an outline. Use overlap (10–20% of chunk size) so boundary clauses are not orphaned. For legal text, prefer splitting after numbered clauses if detected.

Tables. Keep small tables intact; for wide tables, chunk by row ranges but duplicate column headers in each chunk so embeddings are not meaningless tuples of numbers without context. Consider serializing to Markdown or TSV inside the chunk text.

Code. Chunk by function/class when a static parse is available; include file path + language tag in metadata. Avoid cutting mid-function unless windowing demands it—retrieval on half a loop body misleads both model and developer.

Diagrams & figures. If you only embed captions, be honest: image-only slides are poorly searchable unless you run vision captioning or have author-supplied alt text. Store confidence: low metadata so the answering policy can abstain.

Downstream effect. Bad chunking shows up as right document wrong snippet in evals—measure per-modality retrieval quality rather than aggregate accuracy only.

Example. API reference doc: each endpoint becomes one chunk with summary + parameters list; prose ‘Getting started’ uses heading-based chunks; tables of error codes stay intact for BM25-heavy queries.

36. How would you implement hybrid search (dense + sparse) in a RAG pipeline and combine scores effectively?

Why hybrid? Dense embeddings compress meaning but can miss exact tokens: SKUs, version strings (“v3.2.1-hotfix”), internal acronyms, rare drug names. BM25-style sparse retrieval still wins many head-to-head benchmarks on Lexical tasks. Hybrid is not “old tech + new tech”—it is complementary coverage.

Running both lists. Issue parallel ANN + inverted index queries with the same metadata filter set. Filtering after fusion lets junk from other tenants pollute reranking.

Fusion. Reciprocal Rank Fusion (RRF) is popular because it needs no score normalization—robust when score scales differ wildly. Weighted blending works if you calibrate on offline sets: learn weights per vertical (support vs legal) because query style shifts.

Tuning. Log retrieval lists before rerank; analyze queries where BM25 saved you (append-only defects list). Re-evaluate when you change embedding model—relative BM25/dense balance shifts.

Latency. Two retrievals add cost; short-circuit trivial cases (e.g., if query is exact phrase in quotes, try sparse-first).

37. How would you design a multi-stage retrieval pipeline: retrieval → filtering → reranking → context assembly?

Stages exist because you cannot afford cross-encoder reranking on 10k chunks per query, and because hard business rules (ACL, language, region) should not be “prompt suggestions.”

Stage 1 — high-recall retrieval. Cast a wide net with hybrid search (top 100–300) optimized for latency. This stage tolerates noise.

Stage 2 — deterministic filters. Drop anything failing ACL, wrong language, retired SKU docs, or outside legal date window. If this stage nukes everything, return a structured “no evidence” to the UI rather than asking GPT to improvise.

Stage 3 — rerank. Cross-encoder or small transformer reranker on dozens of candidates—here you pay GPU but only on a short list. Consider ColBERT-style late interaction if latency budget allows.

Stage 4 — context assembly. Pack into the token budget with MMR to reduce near-duplicates from the same 30-page PDF repeating boilerplate. Number chunks [1]..[n] and require the generator to reference those ids.

Failure telemetry. Track drop-off counts per stage (“200 → 80 after ACL → 12 after rerank”). Spikes indicate misconfigured filters or an overly aggressive rewriter.

Multi-stage retrieval

flowchart LR
  Q[Query] --> R1[Recall top K]
  R1 --> F[Metadata filter]
  F --> R2[Rerank top N]
  R2 --> P[Pack context MMR]
  P --> LLM[Generator]
            

38. How would you handle retrieval for multi-hop questions that require combining information from multiple documents?

Single-pass top-k assumes one hop answers the question. Multi-hop queries look like: “Compare our 2023 travel policy mileage rate with the 2024 policy and note which regions changed.” Each fact may live in different PDFs and different sections—one embedding search rarely lands all required chunks together.

Query decomposition. Use a planner (small LLM or rules) to output sub-questions with an execution graph: independent sub-queries can run in parallel; dependent ones run sequentially with prior answers injected as context. Cap depth (e.g., max 3 rounds) to control cost.

Iterative retrieval. Retrieve → draft partial bullet outline → retrieve again with an expanded query embedding that includes missing entity names discovered in pass one. Useful when you did not know which SKU or region acronym to search until after first hop.

Graph augmentation. If GraphRAG or a knowledge graph exists, hop along relations (policy → region → exception) with budgets on breadth-first depth.

Consistency checks before generation. If sub-answers disagree, surface conflict rather than stitching—multi-hop amplifies contradiction risk.

Evaluation. Multi-hop needs synthetic or human-labeled suites; standard single-hop benchmarks will give false confidence.

39. How would you implement a RAG system that can handle 20 different languages?

Product clarification. Do users expect answers in their query language while docs are multilingual, or must each tenant stay monolingual? That drives indexing strategy more than model choice.

Embeddings. A strong multilingual embedding model (e.g., sentence-transformers multilingual family or provider equivalents) often beats maintaining 20 separate indexes operationally—at the cost of per-locale nuance. For regulated locales, you might still shard!

Language detection & filters. Detect query language with a fast classifier; apply doc_language == query_language filter when users do not want cross-lingual retrieval. For cross-lingual (query in English, corpus in Spanish), keep retrieval multilingual but force the generator instructions to cite originals and translate faithfully.

Tokenization & normalization. Lowercasing rules differ; CJK segmentation differs; RTL UI may reorder citations—test truncation and highlighting paths.

Evaluation per language. Aggregate accuracy hides weak locales; report recall@k by language bucket and hire native reviewers for top markets.

40. How would you design a RAG architecture that supports both public and private document collections with access control?

The fatal mistake is stuffing both corpora into one index and telling GPT “ignore private docs for external users.” Models do not enforce security; retrieval filters and network boundaries do. Interviewers listen for that sentence.

Metadata modeling. Each chunk carries principals: user ids, group ids, clearance labels, or “public” marker expanded at index time from directory groups. Resolve group membership at query time in the auth service, then pass an OR-of-ANDs filter payload the vector DB understands.

Physical isolation. Regulated customers may require separate collections or clusters per tenant so a bug in filter DSL cannot leak across. Trade cost vs blast radius.

Citations & UX. When an answer blends public FAQ with internal engineering note, label each sentence’s scope in structured output so the UI can suppress internal-only URLs for external viewers.

Testing. Automated tests impersonate users from different LDAP groups and assert retrieved chunk ids ⊆ allowed set—treat like row-level security QA.

41. How would you evaluate and improve the retrieval quality of your RAG system in production?

Split the problem: retrieval quality (did we fetch the evidence?) vs generation quality (was prose faithful). Teams often optimize only end-to-end LLM score and waste time tuning prompts when retrieval never surfaced the right PDF.

Offline IR metrics. Build labeled query–document relevance (binary or graded). Track recall@k, nDCG, MRR stratified by product line and language. Add hard negatives (similar but wrong policy version) because real users confuse versions constantly.

Online proxies. Citation click-through, thumbs, support agent overrides, ‘copy-paste corrected answer’ signals. Beware selection bias—angry users click more.

Error taxonomy. Bucket failures: parser destroyed table, chunk too large, ACL over-filtered, stale doc, reranker inverted order, query drift. Each bucket has different engineering owners.

Improvement loop. Weekly triage: top failure clusters → chunking tweak or synonym map or BM25 boost on title fields. Every change runs the same frozen eval notebook tagged with index_build_id.

42. How would you design a feedback loop that uses user thumbs-up/down signals to improve RAG retrieval?

Thumbs are noisy labels: users may dislike tone while retrieval was perfect, or approve a hallucination. Store rich context: query embedding, retrieved ids + scores, final answer, model + prompt version, user role.

Uses beyond fine-tuning. Train or adapt a reranker with pairwise preference loss (preferred doc vs shown doc). Run contextual bandits on chunking window sizes or hybrid weights—cheap exploration if guardrails exist.

Human runway. Sample thumbs-downs into reviewer queue prioritized by customer tier or failure novelty (new cluster in embedding space).

Gaming & privacy. Rate-limit anonymous feedback; ignore bursts from single IPs. Honor retention—delete feedback rows when customer deletes account.

Closed loop safety. Do not auto-promote pipeline changes from thumbs alone without offline regression—counterfactual regret can rise silently.

43. How would you handle long documents (e.g., 500-page legal contracts) in a RAG system given context window limits?

Long docs defeat naive chunking: either you create thousands of near-duplicate chunks from boilerplate, or you cut mid-clause and lose meaning. Interviewers want a hierarchy: cheap navigation layers before expensive full-text retrieval.

Outline & summary index. First-level retrieval hits a table of contents or LLM-generated section summaries (stored as their own chunks) to decide which hundred pages are relevant before you dive into fine chunks.

Clause-aware chunking. Detect numbering patterns (“1.2.3”) and never split between dependent lines. Keep defined terms (“Unless otherwise stated, ‘Vendor’ means…”) attached to the clause block they govern when parser confidence is high.

Multi-pass UX. Offer: (A) high-level answer referencing sections; (B) “drill into §4.2” follow-up that runs a second retrieval focused narrowly—saves tokens when users accept summaries.

Cost governance. Cap how many chunks from one mammoth contract can appear in a single answer unless user explicitly asks for exhaustive comparison.

Evaluation. Legal teams will compare your answer to Ctrl+F—build internal tools showing retrieved clause text side by side for reviewers.

44. How would you architect a GraphRAG system that enriches retrieval with entity relationships?

GraphRAG helps when answers require relationships—who reports to whom, which service depends on which API, which regulation cross-references another—beyond bag-of-passages.

Extraction. Combine NER/dependency parsing with LLM-assisted relation proposals, validated by rules (confidence thresholds, human QA on new edge types). Expect noise; schedule periodic revalidation.

Storage. Property graph or RDF store plus vector anchors linking nodes back to source snippets (evidence spans). Without evidence links, graph hops become difficult to cite.

Community summaries. Cluster related entities (community detection) and precompute natural-language summaries of those clusters—great for broad questions (“What changed in EU privacy posture across subsidiaries?”).

Query routing. Simple factual questions may stay pure vector; relationship-heavy questions invoke graph expansion with strict hop limits and budgets.

Ops reality. Graph pipelines are expensive; monitor rebuild time, stale relationship rate, and user value—otherwise you built Neo4j fanfiction.

45. How would you design a RAG pipeline that cites its sources in the final response?

Citations are a trust contract. Enterprise users need to click through to the PDF page or wiki revision, not see plausible-but-wrong URLs.

Context packaging. Provide the model numbered snippets with stable ids, titles, deep links, and optional excerpt hashes. Instruct “only cite [n] from this list.”

Post-generation validation. Check every bracket citation is in-range; optionally fuzzy-match key phrases from snippet to answer sentence. Auto-repair loop (second LLM pass) only if metrics show benefit—sometimes better to refuse.

Channel constraints. Voice UIs read short titles; still return structured JSON with URLs to the client. SMS may truncate—never drop the correlation id used by support.

Failure UX. If no snippets cleared quality bar, answer with uncertainty and offer escalation—never fabricate footnotes to look thorough.

46. How would you handle conflicting information from multiple retrieved chunks in a RAG system?

Contradictions are expected in enterprise data: two wikis, an outdated PDF, an email thread. Models optimistically harmonize—your job is to stop silent merging when stakes are high.

Detect. Chunk metadata can include effective_date, authority_tier (policy vs blog), and source_system. If two snippets disagree on a fact, rules may auto-prefer higher tier or newer date—but disclose that logic in the answer when material.

Present both. Summarize position A vs position B with provenance (“HR portal 2024-11” vs “New hire PDF 2023-03”). Offer clarifying question if user’s scenario matches edge case.

Fix root cause. Log conflicts to content owners; many conflicts mean your dedupe/index freshness failed upstream.

Policy. In regulated advice, conflicting guidance may require human approval before sending final text.

47. How would you design a RAG system for a customer support product that must handle product documentation, FAQs, and ticket history?

Separate corpora or heavy tagging. FAQs are short authoritative snippets; docs are long; tickets are messy, PII-heavy, and conversational. Blending them in one bucket without tags confuses retrieval.

PII on tickets. Mask names, emails, phone numbers before embed; optionally restrict ticket retrieval to internal agents only while customers see FAQ+docs.

Router / intent. “How do I reset password?” → FAQ fast path; “Why was invoice #4821 declined?” → CRM tool + maybe ticket RAG for similar cases, not pure vectors on prose.

Recency & duplication. Tickets repeat the same bug; dedupe near-identical threads or summarize clusters so retrieval is not ten copies of the same workaround.

Metrics. Track deflection rate, time-to-resolution, and human agent edits—LLM scores alone miss operations reality.

48. How would you implement query routing in a RAG system that has multiple specialized indexes (e.g., HR docs, technical docs, finance docs)?

Specialized indexes keep embeddings focused—finance vocabulary does not drown HR answers. The hard part is routing wrong-domain queries that sound generic (“benefits for contractors working abroad” could be HR+tax).

Router signals. Train a lightweight classifier on query text + user metadata (department, geo). Output a distribution over indexes rather than a single argmax when uncertain.

Parallel retrieval + fusion. Fire searches on the top two indexes weighted by router confidence, then RRF merge. Include user-visible scope toggles (“Finance only”) to override bad routing.

Cross-domain queries. Detect compound intents (“compare HR leave policy with finance payroll deductions”) and run decomposed sub-queries rather than smushing keywords.

Evaluate routing itself. Log predicted domain vs human-labeled intent to catch systematic bias (e.g., always guessing engineering).

49. How would you design a self-healing RAG pipeline that detects low-confidence retrievals and triggers fallback strategies?

Signals of weakness. Top reranker score margin is tiny; vector distances uniformly mediocre; zero results after filters; user query length extreme; hyDE expansion disagrees with original embedding search. Combine heuristics to avoid false confidence.

Fallback playbook. (1) Query rewrite / expansion with a cheap model; (2) relax benign filters (language) but never ACL; (3) switch to web search tool if enterprise policy permits; (4) escalate to human analyst queue with retrieved attempts logged.

Bounds. Cap fallback rounds (e.g., max two rewrites) to avoid runaway spend. Each attempt logs branch id for later tuning.

Transparency. Users appreciate “I widened search beyond Acme US docs” more than silent mistakes.

Measurement. Track fallback success rate vs human escalations; high escalation after fallback means rewrite strategy is broken.

50. How would you architect a RAG system that works on top of a relational database (Text-to-SQL + RAG hybrid)?

Pure RAG over table exports ages quickly for metrics; pure Text-to-SQL ignores narrative policy. Hybrid answers “What was Q3 ARR in APAC?” and “What narrative risks did leadership disclose about APAC?” in one assistant.

Router. Decide whether the question needs numbers (SQL), long prose (RAG), or both. A classifier can use schema metadata (“this mentions revenue”) plus query patterns.

SQL safety. Generate against read-only views with RLS, validate AST (SELECT-only, allowlisted tables), parameterized binds for user literals, and automatic LIMIT. Log executed SQL with redacted params.

Prompt assembly. Present FACTS section with tabular results as Markdown/JSON and CONTEXT section with retrieved policy chunks. Tell the model numbers must match FACTS verbatim.

Failure modes. Model may still rationalize contradictions—add rule: if SQL says X and doc says Y, call it out explicitly or refuse automation for that class of query.

Audit. Finance users need reproducibility: store query id, warehouse snapshot time, and doc revision ids cited.

Hybrid structured + unstructured

flowchart TB
  Q[Question] --> R[Router]
  R --> SQL[Text-to-SQL + guardrails]
  R --> RAG[Doc retrieval]
  SQL --> TAB[Result table]
  RAG --> TXT[Passages]
  TAB --> GEN[LLM grounded answer]
  TXT --> GEN
            

Recap — this section

QTakeaway
31Own the full path: connectors → raw bytes → parse contract → dual indexes → filtered retrieve → rerank/pack → grounded gen → versioned eval.
32Durable queue, streaming, OCR branch, idempotent chunk ids, batch embed economics, DLQ observability.
33Stable ids + content hash short-circuit; per-doc rechunk + vector deletes; webhook deletes; blue/green flip + reconciliation sweeps.
34Define freshness SLO; webhook/CDC; debounce bursts; selective re-embed; pair DB+RAG when metrics move; surface lag honestly.
35Per-modality rules; headers in table chunks; code at function granularity; honest handling of image-heavy slides.
36Parallel dense+sparse with pre-fusion filters; RRF or calibrated weighted fusion; log dissect failures.
37Wide recall → hard filters → expensive rerank on shortlist → MMR packing with citation ids; stage-level metrics.
38Decompose; parallelizable vs sequential hops; iterative retrieval with caps; graph when available; verify conflicts.
39Clarify cross-lingual needs; multilingual embeddings vs sharded indexes; filter rules; per-locale eval.
40Enforce ACL in retrieval API; optional per-tenant indexes; citation scope in UI; automated impersonation tests.
41Offline IR + online proxies; stratify metrics; taxonomy drives fixes; regression on golden sets per build.
42Rich event schema; reranker/bandit/curation paths; human review; anti-gaming; offline gate before auto-promote.
43Hierarchical summaries + clause-aware chunks + multi-pass UX + caps for exhaustive asks.
44Extract + validate relations; evidence-linked nodes; community summaries; route queries by complexity; watch ops cost.
45Numbered evidence pack + strict instructions + programmatic validation + channel-aware rendering + honest low-evidence path.
46Expect conflicts; rank by authority/recency transparently; surface dueling sources; log for content ops; HITL when needed.
47Typed corpora; ticket PII masking; router + tools for operational intents; dedupe repetitive tickets; ops metrics.
48Soft routing distributions; multi-index search + RRF; user overrides; decomposition for straddling questions; router accuracy metrics.
49Multi-signal confidence; bounded fallbacks that never relax ACL; auditable branches; user-visible explanations; tune with telemetry.
50Route to SQL and/or RAG; hardened read-only SQL; fused prompt with faithful numeric rules; audit trail.

← Section 2 · This section · Design hub · Section 4 →