In interviews, weak answers jump straight to “we embed and call GPT.” Strong answers treat RAG as a productized data path: authoritative sources, parsing quality, retrieval metrics, grounding rules, and reruns when something drifts. The LLM is the last mile; most incidents come from bad chunks, stale indexes, or missing ACLs—not from the choice of GPT-4 vs GPT-5.
1) Connectors & scheduling. You need repeatable pulls from wikis, blob storage, ticketing systems, or database exports. Each connector should capture cursor or revision state, respect API rate limits, and record last successful sync for operators. Failures should be visible in a dashboard, not only in logs.
2) Raw object store. Keep immutable originals (PDF bytes, HTML snapshots, export files) with a content hash. When your PDF parser improves next quarter, you reprocess from raw without asking teams to re-upload. This also helps legal holds and audit (“what did we index on March 3?”).
3) Parse & normalize. Layout-aware parsing preserves headings, tables, and code blocks—flattening everything to plain text often destroys table structure that users ask about. Run language detection, optional OCR for scans, and policy-aware PII scrubbing before text ever reaches embeddings if required by contract.
4) Chunking as a contract. Define a chunk schema: stable chunk_id, doc_id, source_revision, heading breadcrumb, acl_principal list or tenant tags, and offsets. Chunks are your unit of retrieval, re-embed, delete, and cite. If chunk boundaries wander randomly between pipeline versions, your evals becomes non-comparable.
5) Embeddings & indexes. Pick embedding model + dimension deliberately; changing it is a migration. Store vectors in a service that supports metadata filtering (tenant, product, doc_type). Add BM25 or similar sparse retrieval when users query with SKUs, error codes, or rare jargon that dense models blur together.
6) Query-time stack. Typical path: optional query rewrite or HyDE → hybrid recall (e.g., top 50–200) → hard metadata filters (never post-filter only in the prompt) → cross-encoder or lightweight reranker on 20–50 candidates → MMR / packing into a token budget with numbered evidence slots for citations.
7) Generation & citations. Instruct the model to Quote only numbered passages; add a validator that citations exist and (where possible) spot-check substring overlap. For high-risk domains, reject answers that cite nothing or invent URLs—fail closed to human review or “insufficient evidence.”
8) Evaluation & governance. Offline labeled sets for recall@k/MRR by segment; online satisfaction, citation clicks, and escalation reasons. Version together: prompt_id, embedding_model_id, index_build_id. When quality drops, you can diff exactly what changed.
Production RAG — major components
flowchart LR
SRC[Sources] --> RAW[Raw store]
RAW --> PARSE[Parse chunk]
PARSE --> EMB[Embed]
EMB --> VEC[(Vector index)]
PARSE --> SP[(Sparse index)]
Q[User query] --> RET[Retrieve rerank]
VEC --> RET
SP --> RET
RET --> GEN[LLM + citations]