Interview ready · Design · Section 12

Advanced retrieval & ranking

Fifteen staff-depth scenarios on making search good enough before generation: hybrid dense+sparse recall and fusion; learned sparse serving; late-interaction gating and compression; cross-encoder cost controls; typed multi-stage pipelines; HyDE with contamination guards; adaptive k policies; hard-negative mining; structured table retrieval; multilingual routing; freshness-aware ranking; multi-index fusion with degradation; session-aware state; retriever-first benchmarks; and disciplined domain-embedding rollouts.

Interview stance. Advanced retrieval is systems engineering: recall/fusion/rerank/pack each have budgets and failure modes. Show you can reason about score semantics (when RRF beats linear blend), latency stacks, and why a fancier retriever isn’t always the win versus chunking or hybrid baselines.

Hybrid search is default—dense-only is a smell unless offline eval proves dense-alone is enough.
Every fusion rule is a versioned product decision with eval evidence, not a notebook tweak.
Late-interaction and giant cross-encoders need coarse-to-fine serving stories, not brute force.
Multi-index systems fail partial—design degradation per corpus, not all-or-nothing errors.

176. How would you design hybrid dense + sparse retrieval for RAG at scale, and how do you fuse scores?

Why hybrid. Dense embeddings excel at paraphrase; BM25/SPLADE-style signals catch SKUs, error codes, rare entity strings, and neologisms dense models smoosh—production RAG usually needs both.

Recall stage. Run each channel with generous k (parallel or staggered), union candidates with dedupe by doc id and carry both scores forward—do not throw away channel identity before fusion.

Fusion recipes. Weighted linear combos after score normalization (min-max or z-score per query), RRF (Reciprocal Rank Fusion) when scores incommensurable, or learned blend tables tuned offline—pick one and version it.

Latency. Cache BM25 segments; shard posting lists; cap worst-case fan-out with timeouts and partial results.

Observability. Log per-channel hit rate; silent sparse failure should page someone—dense-only degrade changes answer DNA.

177. How would you integrate learned sparse retrieval (e.g., SPLADE-style models) into a production stack?

Serving. Dedicated CPU/GPU microservice emitting sparse vectors or top-k posting lists; batch encoders for ingest, low-latency path for queries.

Indexes. Inverted index tuned for max impact scores; prune tails to control index size.

Training ops. Refresh lexical model when vocab drifts (new product codenames); tie to embedding migrations for coordinated releases.

Tradeoffs. Heavier CPU and storage than vanilla BM25; justify with offline recall lifts on keyword-heavy eval slices.

Ops. Dark launch with shadow traffic; compare nDCG deltas before blending scores for users.

178. How would you decide when multi-vector or late-interaction retrievers (e.g., ColBERT-style) are worth the cost—and how would you serve them?

When. Short documents with multiple claims, legal clauses, or code where a single centroid vector loses precision—measure on golden queries before committing.

Storage & QPS. Token-level vectors balloon RAM; use compression, clustering, or PLAID-style pruning; expect higher dollar-per-query.

Serving path. Two-stage: cheap dense retriever narrows corpus, late interaction reranks hundreds of candidates—not full corpus scans online unless you own the infra story.

Eval. Compare to strong cross-encoder rerank baseline; sometimes reranker upgrades move the same needle cheaper.

Interview note. Say aloud you would prototype, not cargo-cult architecture tweets.

179. How would you trade off cross-encoder reranking quality versus latency and cost in a high-QPS RAG system?

Funnel discipline. Bi-encoder recalls hundreds, cross-encoder scores dozens, LLM sees single digits—tight caps on cross pairs per query.

Mini models. Distilled rerankers or shallow transformers for first pass; reserve large cross for escalations.

Batching. Dynamic batching inside GPU workers with SLA timers so tail latency does not explode.

Caching. Cache rerank results keyed by (query_hash, candidate_id set) for hot FAQs—invalidate on index bump.

Product modes. ‘Fast’ vs ‘accurate’ user toggles map to different cross budgets.

180. How would you architect a multi-stage retrieval pipeline (retrieve → rerank → pack) with clear contracts between stages?

Interfaces. Each stage accepts typed objects: QueryPlan, CandidateChunk lists with scores + provenance—no undocumented dicts crossing team boundaries.

Failure handling. If rerank times out, fall back to pre-rank order with logged degradation flag; never null-pointer downstream LLM.

Observability. Span per stage with in/out counts; SLO per hop.

Extensibility. Plugin third stage (graph expand, intent router) behind feature flags with compatibility tests.

Testing. Golden fixtures assert stable ordering given frozen candidate sets when middle model updates.

Retrieve funnel

flowchart LR
  Q[Query] --> R1[Recall]
  R1 --> R2[Rerank]
  R2 --> P[Pack context]
  P --> L[LLM]

181. How would you use query expansion or HyDE (hypothetical document embeddings) responsibly in production RAG?

Goal. Improve recall when user phrasing mismatches corpus—especially technical support with jargon variance.

Risk. Hypothetical text may hallucinate constraints that poison retrieval; always retrieve with expanded AND original query or constrain expansion with templates.

Cost. Extra LLM calls per query; route only low-confidence lexical matches or ambiguous intents.

Eval. Offline A/B on recall@k with contamination checks—ensure expansion doesn’t pull wrong product lines.

UX honesty. If expansion triggers, optionally show ‘searching related phrasings’ for trust.

182. How would you implement adaptive retrieval depth (dynamic k) based on query difficulty or confidence?

Signals. Bi-encoder score margin, query length, ambiguity classifiers, or fast first-pass BM25 saturation.

Policy. Raise k and enable reranker only when margins low; keep snappy path for obvious FAQs.

Guardrails. Hard caps to prevent runaway context or token bombs in agents.

Measurement. Compare cost vs escalation rate; adaptive k should save money without hurting success metrics.

Fallback. If signals misclassify, user-facing ‘expand search’ button increases k deterministically.

183. How would you mine hard negatives and construct training data to improve retrievers or rerankers?

Sources. Top non-clicked results, cross-encoder false positives, user thumbs-down pairs, and synthetic near-duplicates from back-translation.

Hygiene. Verify negatives are truly irrelevant—borderline pairs teach noise; use consensus labeling.

Balance. Curriculum from easy→hard; oversampling hard without collapsing diversity.

Safety. Strip PII before human review queues; prevent toxic text in contrastive batches.

Iteration. Freeze evaluation harness before iterating mining rules—otherwise you optimize the miner, not retrieval.

184. How would you design retrieval when the knowledge base is mostly structured tables, not prose documents?

Representation. Render rows as templated natural-language ‘facts,’ maintain primary keys in metadata, or dual-path with SQL/BI tools instead of naive chunking.

Hybrid. Combine text answers with numeric queries executed against warehouse for freshness.

Chunking traps. Wide tables need row-level or cell-level strategies; avoid shuffling unrelated columns into one blob.

Eval. Numeric tolerance checks—retrieval may be ‘right row’ but LLM must not garble units.

Tooling. Expose structured tools to agents with schema-strong validation.

185. How would you architect multilingual retrieval when queries and documents span many languages?

Embeddings. Choose multilingual models with documented language coverage; monitor per-language calibration—some codes drift.

Language routing. Detect query language; optionally restrict retrieval to matching doc locales to reduce cross-language false friends.

Tokenization fairness. Byte-pair vocab imbalance can hurt smaller languages; track per-locale latency and quality slices.

Content policy. Moderation and legal rules differ by locale—metadata filters before ranking.

Fallback. Machine translate query as last resort with confidence disclaimers and human review for regulated content.

186. How would you incorporate document freshness or business priority into vector retrieval beyond cosine similarity?

Signals. Source updated_at, editorial boost scores, seasonal campaigns, contract importance flags.

Blending. Combine semantic score with time-decay exponential or step function after SLA threshold; learn weights from click logs if privacy allows.

Per domain. HR policies may decay slowly; pricing feeds decay fast—parameterize by doc_type.

UX. Show ‘updated 3d ago’ chips so users trust rank overrides.

Pitfall. Over-boosting freshness can bury canonical evergreen docs—keep minimum semantic floor.

187. How would you fuse results from multiple specialized indexes (e.g., tickets vs docs vs code) into one answer?

Routing. Intent classifier picks subset of indexes; avoid searching everything by default—latency and contamination rise.

Normalization. Scores across indexes not comparable; use RRF or per-index percentile ranks before merge.

Diversity. MMR across sources so one dominant corpus does not starve others.

Provenance. Answers cite source system; support teams triage by origin.

Failure isolation. If tickets index unhealthy, degrade to docs with banner—do not fail whole query.

188. How would you design conversational or session-aware retrieval that uses prior turns without blowing the context window?

State. Maintain running summary + explicit entity sheet updated each turn; retrieval query rewriter consumes summary + latest user message.

Caches. Reuse stable retrieval results when follow-up is clarification; invalidate when topic shift detector fires.

Budget. Cap retrieved tokens per turn; rolling window of past evidence with LRU eviction by relevance score.

Privacy. Session state is PII-rich—TTL and encryption at rest.

Eval. Multi-turn goldens ensure rewriter doesn’t drop constraints (‘Actually I meant EU region’).

189. How would you benchmark retriever-only changes without running a full LLM generation eval every time?

Metrics. Recall@k, MRR, nDCG on labeled q–doc pairs; label whether answer is achievable from top-k alone.

Proxy tasks. Candidate contains gold chunk id; binary classification cheaper than generative judge.

Sanity. Periodic end-to-end LLM eval still required—retrieval gains can be undone by packing or prompt changes.

Latency microbench. Track stage regressions independently.

CI. Retriever unit tests with frozen embeddings for deterministic runs where possible.

190. When would you train a domain-specific embedding model—and how would you roll it out without breaking existing RAG?

Trigger. Persistent recall gaps on domain jargon after tuning hybrid/rerank, with enough labeled data to learn a specialized space.

Data. Contrastive pairs from clicks, titles, anchors; hard negatives from confusers within catalog.

Rollout. Dual-index bake-off; blue/green flip; router ties answers to embedding version in citations.

Cost. Re-embed corpus is CapEx event—schedule and customer comms.

Reality check. Often better chunking + hybrid closes 80% before custom embedders; say that in interviews to show judgment.

Recap — this section

Q	Takeaway
176	Parallel recall + dedupe; normalized fusion or RRF; latency caps; per-channel health metrics.
177	Dedicated sparse encoder service; pruned inverted index; periodic lexical refresh; shadow eval before fusion.
178	Use-case gated; storage/QPS honesty; coarse-to-fine serving; benchmark vs cross-encoder; prototype discipline.
179	Strict pair budgets; batched GPU rerank; distilled models; selective cache; SKU’d quality tiers.
180	Typed stage contracts; graceful timeout fallback; per-stage tracing; fixture-tested ordering.
181	Targeted expansion; dual-query retrieval; cost-aware gating; contamination eval; transparent UX.
182	Margin/ambiguity signals; capped dynamic k; cost-quality instrumentation; manual widen escape hatch.
183	Curated hard-neg sources; label-verified pairs; balanced curricula; privacy-safe mining; frozen eval harness.
184	Row-aware text projections + PK metadata; SQL/text hybrid; careful wide-table chunking; numeric eval; tools.
185	Multilingual embed + per-locale metrics; language-aware filters; fairness monitoring; policy metadata; cautious MT.
186	Semantic + decay/boost; type-specific half-lives; transparent UI; floors against relevance loss.
187	Intent-limited index set; RRF/percentile merge; cross-source MMR; provenance tags; partial degrade.
188	Summarized session state; conditional re-retrieve; token budgets; private TTL’d state; multi-turn eval.
189	Pairwise IR metrics; gold-chunk-in-topk; periodic E2E check-ins; latency microbenchmarks; CI fixtures.
190	Evidence threshold for custom embeds; contrastive training; blue/green re-embed; versioned answers; proportionality.

← Section 11 · This section · Design hub · Section 13 →