Fifteen staff-depth scenarios on making search good enough before generation: hybrid dense+sparse recall and fusion; learned sparse serving; late-interaction gating and compression; cross-encoder cost controls; typed multi-stage pipelines; HyDE with contamination guards; adaptive k policies; hard-negative mining; structured table retrieval; multilingual routing; freshness-aware ranking; multi-index fusion with degradation; session-aware state; retriever-first benchmarks; and disciplined domain-embedding rollouts.
Interview stance. Advanced retrieval is systems engineering: recall/fusion/rerank/pack each have budgets and failure modes. Show you can reason about score semantics (when RRF beats linear blend), latency stacks, and why a fancier retriever isn’t always the win versus chunking or hybrid baselines.
Hybrid search is default—dense-only is a smell unless offline eval proves dense-alone is enough.
Every fusion rule is a versioned product decision with eval evidence, not a notebook tweak.
Late-interaction and giant cross-encoders need coarse-to-fine serving stories, not brute force.
Multi-index systems fail partial—design degradation per corpus, not all-or-nothing errors.
176. How would you design hybrid dense + sparse retrieval for RAG at scale, and how do you fuse scores?
Why hybrid. Dense embeddings excel at paraphrase; BM25/SPLADE-style signals catch SKUs, error codes, rare entity strings, and neologisms dense models smoosh—production RAG usually needs both.
Recall stage. Run each channel with generous k (parallel or staggered), union candidates with dedupe by doc id and carry both scores forward—do not throw away channel identity before fusion.
Fusion recipes. Weighted linear combos after score normalization (min-max or z-score per query), RRF (Reciprocal Rank Fusion) when scores incommensurable, or learned blend tables tuned offline—pick one and version it.
Latency. Cache BM25 segments; shard posting lists; cap worst-case fan-out with timeouts and partial results.
Observability. Log per-channel hit rate; silent sparse failure should page someone—dense-only degrade changes answer DNA.
177. How would you integrate learned sparse retrieval (e.g., SPLADE-style models) into a production stack?
Serving. Dedicated CPU/GPU microservice emitting sparse vectors or top-k posting lists; batch encoders for ingest, low-latency path for queries.
Indexes. Inverted index tuned for max impact scores; prune tails to control index size.
Training ops. Refresh lexical model when vocab drifts (new product codenames); tie to embedding migrations for coordinated releases.
Tradeoffs. Heavier CPU and storage than vanilla BM25; justify with offline recall lifts on keyword-heavy eval slices.
Ops. Dark launch with shadow traffic; compare nDCG deltas before blending scores for users.
178. How would you decide when multi-vector or late-interaction retrievers (e.g., ColBERT-style) are worth the cost—and how would you serve them?
When. Short documents with multiple claims, legal clauses, or code where a single centroid vector loses precision—measure on golden queries before committing.
Storage & QPS. Token-level vectors balloon RAM; use compression, clustering, or PLAID-style pruning; expect higher dollar-per-query.
Serving path. Two-stage: cheap dense retriever narrows corpus, late interaction reranks hundreds of candidates—not full corpus scans online unless you own the infra story.
Eval. Compare to strong cross-encoder rerank baseline; sometimes reranker upgrades move the same needle cheaper.
Interview note. Say aloud you would prototype, not cargo-cult architecture tweets.
179. How would you trade off cross-encoder reranking quality versus latency and cost in a high-QPS RAG system?
Funnel discipline. Bi-encoder recalls hundreds, cross-encoder scores dozens, LLM sees single digits—tight caps on cross pairs per query.
Mini models. Distilled rerankers or shallow transformers for first pass; reserve large cross for escalations.
Batching. Dynamic batching inside GPU workers with SLA timers so tail latency does not explode.
Caching. Cache rerank results keyed by (query_hash, candidate_id set) for hot FAQs—invalidate on index bump.
Product modes. ‘Fast’ vs ‘accurate’ user toggles map to different cross budgets.
180. How would you architect a multi-stage retrieval pipeline (retrieve → rerank → pack) with clear contracts between stages?
Interfaces. Each stage accepts typed objects: QueryPlan, CandidateChunk lists with scores + provenance—no undocumented dicts crossing team boundaries.
Failure handling. If rerank times out, fall back to pre-rank order with logged degradation flag; never null-pointer downstream LLM.
Observability. Span per stage with in/out counts; SLO per hop.
Extensibility. Plugin third stage (graph expand, intent router) behind feature flags with compatibility tests.
Testing. Golden fixtures assert stable ordering given frozen candidate sets when middle model updates.
Retrieve funnel
flowchart LR
Q[Query] --> R1[Recall]
R1 --> R2[Rerank]
R2 --> P[Pack context]
P --> L[LLM]
181. How would you use query expansion or HyDE (hypothetical document embeddings) responsibly in production RAG?
Goal. Improve recall when user phrasing mismatches corpus—especially technical support with jargon variance.
Risk. Hypothetical text may hallucinate constraints that poison retrieval; always retrieve with expanded AND original query or constrain expansion with templates.
Cost. Extra LLM calls per query; route only low-confidence lexical matches or ambiguous intents.
Eval. Offline A/B on recall@k with contamination checks—ensure expansion doesn’t pull wrong product lines.
UX honesty. If expansion triggers, optionally show ‘searching related phrasings’ for trust.
182. How would you implement adaptive retrieval depth (dynamic k) based on query difficulty or confidence?
Signals. Bi-encoder score margin, query length, ambiguity classifiers, or fast first-pass BM25 saturation.
Policy. Raise k and enable reranker only when margins low; keep snappy path for obvious FAQs.
Guardrails. Hard caps to prevent runaway context or token bombs in agents.
Measurement. Compare cost vs escalation rate; adaptive k should save money without hurting success metrics.
Fallback. If signals misclassify, user-facing ‘expand search’ button increases k deterministically.
183. How would you mine hard negatives and construct training data to improve retrievers or rerankers?
Sources. Top non-clicked results, cross-encoder false positives, user thumbs-down pairs, and synthetic near-duplicates from back-translation.
Hygiene. Verify negatives are truly irrelevant—borderline pairs teach noise; use consensus labeling.
Balance. Curriculum from easy→hard; oversampling hard without collapsing diversity.
Safety. Strip PII before human review queues; prevent toxic text in contrastive batches.
Iteration. Freeze evaluation harness before iterating mining rules—otherwise you optimize the miner, not retrieval.
184. How would you design retrieval when the knowledge base is mostly structured tables, not prose documents?
Representation. Render rows as templated natural-language ‘facts,’ maintain primary keys in metadata, or dual-path with SQL/BI tools instead of naive chunking.
Hybrid. Combine text answers with numeric queries executed against warehouse for freshness.
Chunking traps. Wide tables need row-level or cell-level strategies; avoid shuffling unrelated columns into one blob.
Eval. Numeric tolerance checks—retrieval may be ‘right row’ but LLM must not garble units.
Tooling. Expose structured tools to agents with schema-strong validation.
185. How would you architect multilingual retrieval when queries and documents span many languages?
Embeddings. Choose multilingual models with documented language coverage; monitor per-language calibration—some codes drift.
Language routing. Detect query language; optionally restrict retrieval to matching doc locales to reduce cross-language false friends.
Tokenization fairness. Byte-pair vocab imbalance can hurt smaller languages; track per-locale latency and quality slices.
Content policy. Moderation and legal rules differ by locale—metadata filters before ranking.
Fallback. Machine translate query as last resort with confidence disclaimers and human review for regulated content.
186. How would you incorporate document freshness or business priority into vector retrieval beyond cosine similarity?