Production-grade RAG for 10 million internal documents
You are designing a company-wide RAG system from scratch: about 10 million items across PDFs, Confluence, Slack, and database records. The panel wants the full path—ingest → chunk → embed → store → retrieve → rerank → generate—and where it breaks in production.
Scenario
Design a production-grade RAG pipeline from scratch.
Your company has 10 million internal documents across PDFs, Confluence pages, Slack threads, and database records. Design the full pipeline—ingestion, chunking, embedding, storage, retrieval, re-ranking, and generation.
Where are the failure points, and how do you handle them?
What you should be able to do after reading:
- Open with clarifying questions, then a one-minute architecture story before diving into boxes.
- Describe a unified document model and ACL-aware indexes—not four siloed search tools.
- Walk each stage with tradeoffs: batch vs streaming ingest, hybrid retrieval, rerank budget, citation-first generation.
- Name failure points with concrete mitigations (staleness, poison docs, permission leaks, index drift).
Step 0 — How to open the session (first five minutes)
Strong candidates do not draw microservices immediately. They frame the problem:
- Confirm users and risk. All employees? Support bots only? Are Slack DMs in scope or public channels only?
- Confirm freshness. “Good enough” lag: Confluence in 15 minutes, Slack in 1 minute, warehouse tables hourly?
- Confirm security. Retrieval must enforce document ACLs—no “search then filter” that leaks titles in logs.
- Confirm SLAs. p95 query latency (e.g. 3–8s end-to-end), availability, and budget for GPU rerank + LLM.
- Then present the macro pipeline and napkin math so scale feels grounded.
Step 1 — Clarifying questions (show senior judgment)
| Topic | Questions you ask | Why it matters |
|---|---|---|
| Scope | Read-only Q&A or also actions (create ticket, post reply)? | Changes orchestration and audit requirements |
| Slack | Threads vs channels; retention; exclude bots and memes? | Noise dominates retrieval if you ingest everything |
| Database rows | Expose raw rows or only approved “fact sheets”? | Schema changes break naive chunking |
| Duplicates | Same policy PDF in SharePoint and Confluence? | Dedup and canonical URL strategy |
| Compliance | PII redaction at ingest? EU residency? | Drives region pinning and delete propagation |
| Eval | Human labels per domain or synthetic only? | You cannot tune rerank without domain gold sets |
Step 2 — The sixty-second answer
I would build one ingestion plane that normalizes every source into a versioned Document + Chunk model in Postgres, fans out to OpenSearch (lexical) and a vector index with the same chunk ids and ACL metadata, runs hybrid retrieval with ACL filters applied inside the index, reranks with a cross-encoder, and generates with mandatory citations and query/audit logging. Ingest is async and idempotent; embeddings are versioned so we can reindex without downtime.
Failure handling is explicit: connector circuit breakers, dead-letter queues, poison-document quarantine, embedding backlog autoscaling, stale-content markers in the UI, and weekly eval gates before rolling new embed models.
Step 3 — Requirements (functional and non-functional)
Functional
| Capability | Behavior |
|---|---|
| Unified search | One query bar across sources; filters by type, space, date, author |
| Grounded answers | LLM uses only retrieved chunks; citations with deep links back to source |
| Freshness | Near-real-time for Slack; minutes for wiki; scheduled for warehouse exports |
| Access control | User sees only chunks their identity can read in the source system |
| Deletes | Removed Confluence page or retracted Slack message disappears from index |
| Admin | Reindex, blocklist noisy spaces, sample queries for debugging |
Non-functional (what “production-grade” means here)
| Dimension | Target (example you can defend) |
|---|---|
| Scale | ~10M logical documents, ~80–150M chunks after chunking |
| Query volume | 2k–10k RAG queries/day, bursts at quarter-end |
| Latency | p95 < 6s: retrieve 400ms, rerank 1.5s, LLM TTFT < 2s |
| Availability | 99.9% on read path; ingest can lag with clear UI badge |
| Cost | Cap $/query via rerank pool size, cache, smaller generator for drafts |
| Audit | Log query, retrieved chunk ids, model version, user id (not raw secrets) |
Step 4 — Napkin math (ground the design)
Assume average mix after filtering noise:
- 10M logical docs → ~120M chunks (PDFs and wikis chunk smaller; Slack threads vary).
- Embedding 120M chunks × 768-dim × 4 bytes ≈ 350 GB raw vectors (plus index overhead → plan 0.5–1 TB vector tier).
- OpenSearch lexical index often 2–3× source text size → hundreds of GB; shard by
source_type. - Ingest throughput: initial backfill 10M docs in 2–4 weeks with 50–100 parallel embed workers; steady state millions of chunk updates/day from Slack alone.
- Embedding cost one-time backfill: order of tens of thousands of dollars on managed embed APIs—justify GPU self-host for steady state.
Phrase that lands well: “Ten million documents is not one vector database problem—it is a data integration and permissions problem with search attached.”
Step 5 — Unified document model (glue across four sources)
Every source connector writes the same envelope:
Document {
doc_id, source: pdf|confluence|slack|db,
source_uri, canonical_id,
title, author, created_at, updated_at,
acl: { allow_groups[], allow_users[], deny[] },
content_hash, language, tags[],
status: active|deleted|quarantined
}
Chunk {
chunk_id, doc_id, seq,
text, token_count,
struct: { page?, table_id?, thread_ts?, row_pk? },
embed_model_version
}
Postgres (or similar) is the system of record. Search indexes are disposable projections you rebuild from Postgres + object storage blobs.
Step 6 — End-to-end architecture
flowchart TB
subgraph sources [Sources]
PDF[PDF / SharePoint]
CONF[Confluence]
SL[Slack]
DB[(Warehouse / OLTP)]
end
subgraph ingest [Ingestion plane]
CONN[Connectors + CDC]
NORM[Normalize + ACL map]
PARSE[Parse / OCR / thread assembly]
CHUNK[Chunker]
QUEUE[Kafka ingest topics]
end
subgraph index [Indexing plane]
LEX[(OpenSearch)]
VEC[(Vector DB)]
BLOB[(Object storage raw)]
end
subgraph serve [Serving plane]
API[RAG API]
RET[Hybrid retrieve + ACL]
RER[Reranker]
LLM[LLM gateway]
end
PDF --> CONN
CONF --> CONN
SL --> CONN
DB --> CONN
CONN --> NORM --> PARSE --> CHUNK --> QUEUE
CHUNK --> BLOB
QUEUE --> LEX
QUEUE --> VEC
API --> RET
RET --> LEX
RET --> VEC
RET --> RER --> LLM
Step 7 — Ingestion by source (where designs differ)
PDFs
- Connector watches object storage / ECM; emits
content_hashon change. - Layout-aware parse (tables, headers); OCR queue for scans with confidence score.
- Failure: OCR garbage → quarantine bucket; do not index until confidence > threshold.
Confluence
- Poll REST or webhook on
page_updated; store space key and page version. - Strip macros; expand includes once to avoid duplicate chunks.
- Map Confluence restrictions →
aclgroups; re-sync ACL nightly even if body unchanged. - Failure: permission drift → user sees “access denied” at source link; chunk already evicted from their index view.
Slack
- Ingest threads as one document when possible (parent + replies), not every message as isolated chunk.
- Exclude DMs unless product explicitly allows; drop bot spam via allowlist.
- High churn: use incremental doc ids per thread; tombstone on
message_deletedevents. - Failure: toxic/noisy channels dominate retrieval → admin blocklist + downrank
source=slackby default.
Database records
- Do not dump raw SQL rows as prose only—emit templated fact sheets: “Customer ACME | tier=Gold | ARR=$1.2M”.
- CDC from warehouse; primary key in
canonical_idfor upserts. - Failure: schema migration changes column names → mapping layer versioned; old chunks deleted by
doc_idsweep.
Step 8 — Chunking (one size does not fit all)
| Source | Strategy | Typical size |
|---|---|---|
| PDF / long wiki | Structure-aware: headings + paragraphs; tables as row or mini-table chunks | 300–600 tokens |
| Confluence | Respect heading hierarchy; keep code blocks intact | 250–500 tokens |
| Slack thread | Whole thread up to cap; split only if > 2k tokens with overlap | 1 thread = 1–N chunks |
| DB fact sheet | One row or small group per chunk | 50–150 tokens |
Use parent–child linking: small child chunks for retrieval, parent summary (auto-generated) for LLM context expansion after hit.
Step 9 — Embedding and storage
Embedding pipeline
- Version everything:
embed_model_versionon each chunk; dual-write during model upgrades. - Batch workers consume Kafka; GPU pool for throughput; rate-limit per tenant.
- Idempotent: same
chunk_id + versionoverwrites, never duplicates. - Store raw text in object storage; vectors in vector DB (sharded by
source_typeor tenant).
Why also OpenSearch
Slack handles, ticket ids, error codes, internal acronyms, and exact policy numbers still need lexical match. Production RAG at this scale is hybrid or recall suffers.
Step 10 — Retrieval, re-ranking, generation
Retrieval
- Parse query; apply optional filters (source, date, space).
- Run BM25 and vector search in parallel with ACL filter pushed into both (filter-first, never score-then-filter).
- Fuse with RRF; take top 50–80 candidates.
- Per-source caps so Slack does not swamp PDF policy (e.g. max 30% slack hits).
Re-ranking
Cross-encoder on top 40 → keep 12–16 chunks. Cache rerank for repeated enterprise queries (“PTO policy”, “expense report”).
Generation
- Prompt with chunk text + citation ids (
[1]→ deep link). - Stream tokens; show citations inline in UI.
- If retrieval score spread is flat, answer: “I found related material but am not confident”—abstain beats hallucination.
Step 11 — Failure points and mitigations (core of the question)
| Stage | Failure | What breaks | Mitigation |
|---|---|---|---|
| Connector | API rate limit / token expiry | Stale wiki; silent gaps | Exponential backoff; checkpoint cursors; alert on lag > SLA |
| Parse | OCR / PDF layout fail | Missing tables | Quarantine; human QA sample; fallback “text-only” banner |
| ACL map | Wrong group mapping | Leak or over-deny | Nightly ACL reconcile; deny-by-default; security review on mapper |
| Chunking | Split mid-table / mid-code | Wrong facts | Structure-aware chunker; tests on golden pages |
| Embed queue | Backlog after outage | Users see old content | Autoscale workers; priority lane for high-trust sources |
| Vector index | Shard hot spot | p95 latency spikes | Shard by source; rehearse failover; replicate read views |
| Lexical index | Cluster yellow/red | Degraded recall | Multi-AZ; frozen tier for old docs; circuit break to vector-only with warning |
| Retrieval | Filter-after-search | ACL leak in logs | Filter-first in engine; never log pre-filter hits |
| Retrieval | Slack noise | Joke answers | Channel blocklist; downrank; require min lexical score |
| Rerank | GPU pool saturated | Timeouts | Queue + smaller reranker fallback; cap candidates |
| LLM | Context overflow | Truncated policy | Token budgeter; summarize low-score chunks first |
| LLM | Hallucination | Wrong guidance | Citations required; verifier on key entities; human feedback loop |
| Ops | Reindex bad embed model | Company-wide wrong answers | Blue/green index; rollback pointer; eval gate before cutover |
Step 12 — Observability and operations
- Metrics: ingest lag per connector, chunks/min, embed queue depth, retrieval recall@k on gold set, rerank latency, TTFT, citation click-through.
- Tracing: one trace id per user query across retrieve → rerank → generate.
- Feedback: thumbs down stores query + chunk ids → weekly hard negatives for reranker fine-tune.
- Runbooks: “OpenSearch red”, “embed backlog > 24h”, “Confluence ACL sync failed”.
Step 13 — How to walk through this in a design session
A crisp order that fits a 45-minute panel:
- 2 min — clarifiers + security/freshness.
- 3 min — napkin math and unified document model.
- 10 min — draw three planes: ingest, index, serve (diagram above).
- 10 min — deep-dive one source (usually Slack or ACL) and one read path.
- 8 min — failure table + mitigations.
- 5 min — eval, rollout (shadow mode → % traffic), cost knobs.
- 2 min — close with tradeoffs you would defer (fine-tune LLM, agent tools, multi-hop).
Step 14 — Goals → knobs
| Goal | Knob |
|---|---|
| Higher recall | Larger k, hybrid search, parent-child expansion |
| Lower latency | Smaller rerank pool; cache; pre-warm frequent queries |
| Lower cost | Cheaper embed model; fewer chunks per doc; tiered storage for old Slack |
| Safer answers | Stricter abstain threshold; mandatory citations; blocklist sources |
| Fresher Slack | Dedicated fast lane consumer; smaller batches |
Step 15 — Close the loop
Whiteboard: three planes, ACL on the retrieval arrow, dual indexes feeding one chunk id.
Out loud: one Confluence page update from webhook → re-chunk → re-embed → visible answer change.
If pushed on MVP: Phase 1 PDF+Confluence hybrid RAG; Phase 2 Slack with blocklists; Phase 3 DB facts; always ACL and audit from day one.
The one line to remember
At ten million documents, production RAG is a permissioned search platform with an LLM on top—ingest and ACL correctness are the product; embeddings and reranking are how you make answers readable, grounded, and fast enough to trust.