Production-grade RAG for 10 million internal documents

You are designing a company-wide RAG system from scratch: about 10 million items across PDFs, Confluence, Slack, and database records. The panel wants the full path—ingest → chunk → embed → store → retrieve → rerank → generate—and where it breaks in production.

Scenario

Design a production-grade RAG pipeline from scratch.

Your company has 10 million internal documents across PDFs, Confluence pages, Slack threads, and database records. Design the full pipeline—ingestion, chunking, embedding, storage, retrieval, re-ranking, and generation.

Where are the failure points, and how do you handle them?

What you should be able to do after reading:

Open with clarifying questions, then a one-minute architecture story before diving into boxes.
Describe a unified document model and ACL-aware indexes—not four siloed search tools.
Walk each stage with tradeoffs: batch vs streaming ingest, hybrid retrieval, rerank budget, citation-first generation.
Name failure points with concrete mitigations (staleness, poison docs, permission leaks, index drift).

Step 0 — How to open the session (first five minutes)

Strong candidates do not draw microservices immediately. They frame the problem:

Confirm users and risk. All employees? Support bots only? Are Slack DMs in scope or public channels only?
Confirm freshness. “Good enough” lag: Confluence in 15 minutes, Slack in 1 minute, warehouse tables hourly?
Confirm security. Retrieval must enforce document ACLs—no “search then filter” that leaks titles in logs.
Confirm SLAs. p95 query latency (e.g. 3–8s end-to-end), availability, and budget for GPU rerank + LLM.
Then present the macro pipeline and napkin math so scale feels grounded.

Step 1 — Clarifying questions (show senior judgment)

Topic	Questions you ask	Why it matters
Scope	Read-only Q&A or also actions (create ticket, post reply)?	Changes orchestration and audit requirements
Slack	Threads vs channels; retention; exclude bots and memes?	Noise dominates retrieval if you ingest everything
Database rows	Expose raw rows or only approved “fact sheets”?	Schema changes break naive chunking
Duplicates	Same policy PDF in SharePoint and Confluence?	Dedup and canonical URL strategy
Compliance	PII redaction at ingest? EU residency?	Drives region pinning and delete propagation
Eval	Human labels per domain or synthetic only?	You cannot tune rerank without domain gold sets

Step 2 — The sixty-second answer

I would build one ingestion plane that normalizes every source into a versioned Document + Chunk model in Postgres, fans out to OpenSearch (lexical) and a vector index with the same chunk ids and ACL metadata, runs hybrid retrieval with ACL filters applied inside the index, reranks with a cross-encoder, and generates with mandatory citations and query/audit logging. Ingest is async and idempotent; embeddings are versioned so we can reindex without downtime.

Failure handling is explicit: connector circuit breakers, dead-letter queues, poison-document quarantine, embedding backlog autoscaling, stale-content markers in the UI, and weekly eval gates before rolling new embed models.

Step 3 — Requirements (functional and non-functional)

Functional

Capability	Behavior
Unified search	One query bar across sources; filters by type, space, date, author
Grounded answers	LLM uses only retrieved chunks; citations with deep links back to source
Freshness	Near-real-time for Slack; minutes for wiki; scheduled for warehouse exports
Access control	User sees only chunks their identity can read in the source system
Deletes	Removed Confluence page or retracted Slack message disappears from index
Admin	Reindex, blocklist noisy spaces, sample queries for debugging

Non-functional (what “production-grade” means here)

Dimension	Target (example you can defend)
Scale	~10M logical documents, ~80–150M chunks after chunking
Query volume	2k–10k RAG queries/day, bursts at quarter-end
Latency	p95 < 6s: retrieve 400ms, rerank 1.5s, LLM TTFT < 2s
Availability	99.9% on read path; ingest can lag with clear UI badge
Cost	Cap $/query via rerank pool size, cache, smaller generator for drafts
Audit	Log query, retrieved chunk ids, model version, user id (not raw secrets)

Step 4 — Napkin math (ground the design)

Assume average mix after filtering noise:

10M logical docs → ~120M chunks (PDFs and wikis chunk smaller; Slack threads vary).
Embedding 120M chunks × 768-dim × 4 bytes ≈ 350 GB raw vectors (plus index overhead → plan 0.5–1 TB vector tier).
OpenSearch lexical index often 2–3× source text size → hundreds of GB; shard by source_type.
Ingest throughput: initial backfill 10M docs in 2–4 weeks with 50–100 parallel embed workers; steady state millions of chunk updates/day from Slack alone.
Embedding cost one-time backfill: order of tens of thousands of dollars on managed embed APIs—justify GPU self-host for steady state.

Phrase that lands well: “Ten million documents is not one vector database problem—it is a data integration and permissions problem with search attached.”

Step 5 — Unified document model (glue across four sources)

Every source connector writes the same envelope:

Document {
  doc_id, source: pdf|confluence|slack|db,
  source_uri, canonical_id,
  title, author, created_at, updated_at,
  acl: { allow_groups[], allow_users[], deny[] },
  content_hash, language, tags[],
  status: active|deleted|quarantined
}
Chunk {
  chunk_id, doc_id, seq,
  text, token_count,
  struct: { page?, table_id?, thread_ts?, row_pk? },
  embed_model_version
}

Postgres (or similar) is the system of record. Search indexes are disposable projections you rebuild from Postgres + object storage blobs.

Step 6 — End-to-end architecture

flowchart TB
  subgraph sources [Sources]
    PDF[PDF / SharePoint]
    CONF[Confluence]
    SL[Slack]
    DB[(Warehouse / OLTP)]
  end
  subgraph ingest [Ingestion plane]
    CONN[Connectors + CDC]
    NORM[Normalize + ACL map]
    PARSE[Parse / OCR / thread assembly]
    CHUNK[Chunker]
    QUEUE[Kafka ingest topics]
  end
  subgraph index [Indexing plane]
    LEX[(OpenSearch)]
    VEC[(Vector DB)]
    BLOB[(Object storage raw)]
  end
  subgraph serve [Serving plane]
    API[RAG API]
    RET[Hybrid retrieve + ACL]
    RER[Reranker]
    LLM[LLM gateway]
  end
  PDF --> CONN
  CONF --> CONN
  SL --> CONN
  DB --> CONN
  CONN --> NORM --> PARSE --> CHUNK --> QUEUE
  CHUNK --> BLOB
  QUEUE --> LEX
  QUEUE --> VEC
  API --> RET
  RET --> LEX
  RET --> VEC
  RET --> RER --> LLM

Step 7 — Ingestion by source (where designs differ)

PDFs

Connector watches object storage / ECM; emits content_hash on change.
Layout-aware parse (tables, headers); OCR queue for scans with confidence score.
Failure: OCR garbage → quarantine bucket; do not index until confidence > threshold.

Confluence

Poll REST or webhook on page_updated; store space key and page version.
Strip macros; expand includes once to avoid duplicate chunks.
Map Confluence restrictions → acl groups; re-sync ACL nightly even if body unchanged.
Failure: permission drift → user sees “access denied” at source link; chunk already evicted from their index view.

Slack

Ingest threads as one document when possible (parent + replies), not every message as isolated chunk.
Exclude DMs unless product explicitly allows; drop bot spam via allowlist.
High churn: use incremental doc ids per thread; tombstone on message_deleted events.
Failure: toxic/noisy channels dominate retrieval → admin blocklist + downrank source=slack by default.

Database records

Do not dump raw SQL rows as prose only—emit templated fact sheets: “Customer ACME | tier=Gold | ARR=$1.2M”.
CDC from warehouse; primary key in canonical_id for upserts.
Failure: schema migration changes column names → mapping layer versioned; old chunks deleted by doc_id sweep.

Step 8 — Chunking (one size does not fit all)

Source	Strategy	Typical size
PDF / long wiki	Structure-aware: headings + paragraphs; tables as row or mini-table chunks	300–600 tokens
Confluence	Respect heading hierarchy; keep code blocks intact	250–500 tokens
Slack thread	Whole thread up to cap; split only if > 2k tokens with overlap	1 thread = 1–N chunks
DB fact sheet	One row or small group per chunk	50–150 tokens

Use parent–child linking: small child chunks for retrieval, parent summary (auto-generated) for LLM context expansion after hit.

Step 9 — Embedding and storage

Embedding pipeline

Version everything: embed_model_version on each chunk; dual-write during model upgrades.
Batch workers consume Kafka; GPU pool for throughput; rate-limit per tenant.
Idempotent: same chunk_id + version overwrites, never duplicates.
Store raw text in object storage; vectors in vector DB (sharded by source_type or tenant).

Why also OpenSearch

Slack handles, ticket ids, error codes, internal acronyms, and exact policy numbers still need lexical match. Production RAG at this scale is hybrid or recall suffers.

Step 10 — Retrieval, re-ranking, generation

Retrieval

Parse query; apply optional filters (source, date, space).
Run BM25 and vector search in parallel with ACL filter pushed into both (filter-first, never score-then-filter).
Fuse with RRF; take top 50–80 candidates.
Per-source caps so Slack does not swamp PDF policy (e.g. max 30% slack hits).

Re-ranking

Cross-encoder on top 40 → keep 12–16 chunks. Cache rerank for repeated enterprise queries (“PTO policy”, “expense report”).

Generation

Prompt with chunk text + citation ids ([1] → deep link).
Stream tokens; show citations inline in UI.
If retrieval score spread is flat, answer: “I found related material but am not confident”—abstain beats hallucination.

Step 11 — Failure points and mitigations (core of the question)

Stage	Failure	What breaks	Mitigation
Connector	API rate limit / token expiry	Stale wiki; silent gaps	Exponential backoff; checkpoint cursors; alert on lag > SLA
Parse	OCR / PDF layout fail	Missing tables	Quarantine; human QA sample; fallback “text-only” banner
ACL map	Wrong group mapping	Leak or over-deny	Nightly ACL reconcile; deny-by-default; security review on mapper
Chunking	Split mid-table / mid-code	Wrong facts	Structure-aware chunker; tests on golden pages
Embed queue	Backlog after outage	Users see old content	Autoscale workers; priority lane for high-trust sources
Vector index	Shard hot spot	p95 latency spikes	Shard by source; rehearse failover; replicate read views
Lexical index	Cluster yellow/red	Degraded recall	Multi-AZ; frozen tier for old docs; circuit break to vector-only with warning
Retrieval	Filter-after-search	ACL leak in logs	Filter-first in engine; never log pre-filter hits
Retrieval	Slack noise	Joke answers	Channel blocklist; downrank; require min lexical score
Rerank	GPU pool saturated	Timeouts	Queue + smaller reranker fallback; cap candidates
LLM	Context overflow	Truncated policy	Token budgeter; summarize low-score chunks first
LLM	Hallucination	Wrong guidance	Citations required; verifier on key entities; human feedback loop
Ops	Reindex bad embed model	Company-wide wrong answers	Blue/green index; rollback pointer; eval gate before cutover

Step 12 — Observability and operations

Metrics: ingest lag per connector, chunks/min, embed queue depth, retrieval recall@k on gold set, rerank latency, TTFT, citation click-through.
Tracing: one trace id per user query across retrieve → rerank → generate.
Feedback: thumbs down stores query + chunk ids → weekly hard negatives for reranker fine-tune.
Runbooks: “OpenSearch red”, “embed backlog > 24h”, “Confluence ACL sync failed”.

Step 13 — How to walk through this in a design session

A crisp order that fits a 45-minute panel:

2 min — clarifiers + security/freshness.
3 min — napkin math and unified document model.
10 min — draw three planes: ingest, index, serve (diagram above).
10 min — deep-dive one source (usually Slack or ACL) and one read path.
8 min — failure table + mitigations.
5 min — eval, rollout (shadow mode → % traffic), cost knobs.
2 min — close with tradeoffs you would defer (fine-tune LLM, agent tools, multi-hop).

Step 14 — Goals → knobs

Goal	Knob
Higher recall	Larger k, hybrid search, parent-child expansion
Lower latency	Smaller rerank pool; cache; pre-warm frequent queries
Lower cost	Cheaper embed model; fewer chunks per doc; tiered storage for old Slack
Safer answers	Stricter abstain threshold; mandatory citations; blocklist sources
Fresher Slack	Dedicated fast lane consumer; smaller batches

Step 15 — Close the loop

Whiteboard: three planes, ACL on the retrieval arrow, dual indexes feeding one chunk id.

Out loud: one Confluence page update from webhook → re-chunk → re-embed → visible answer change.

If pushed on MVP: Phase 1 PDF+Confluence hybrid RAG; Phase 2 Slack with blocklists; Phase 3 DB facts; always ACL and audit from day one.

The one line to remember

At ten million documents, production RAG is a permissioned search platform with an LLM on top—ingest and ACL correctness are the product; embeddings and reranking are how you make answers readable, grounded, and fast enough to trust.