sharpbyte.dev

Production-grade RAG for 10 million internal documents

You are designing a company-wide RAG system from scratch: about 10 million items across PDFs, Confluence, Slack, and database records. The panel wants the full path—ingest → chunk → embed → store → retrieve → rerank → generate—and where it breaks in production.

Scenario

Design a production-grade RAG pipeline from scratch.

Your company has 10 million internal documents across PDFs, Confluence pages, Slack threads, and database records. Design the full pipeline—ingestion, chunking, embedding, storage, retrieval, re-ranking, and generation.

Where are the failure points, and how do you handle them?

What you should be able to do after reading:

Step 0 — How to open the session (first five minutes)

Strong candidates do not draw microservices immediately. They frame the problem:

  1. Confirm users and risk. All employees? Support bots only? Are Slack DMs in scope or public channels only?
  2. Confirm freshness. “Good enough” lag: Confluence in 15 minutes, Slack in 1 minute, warehouse tables hourly?
  3. Confirm security. Retrieval must enforce document ACLs—no “search then filter” that leaks titles in logs.
  4. Confirm SLAs. p95 query latency (e.g. 3–8s end-to-end), availability, and budget for GPU rerank + LLM.
  5. Then present the macro pipeline and napkin math so scale feels grounded.

Step 1 — Clarifying questions (show senior judgment)

TopicQuestions you askWhy it matters
ScopeRead-only Q&A or also actions (create ticket, post reply)?Changes orchestration and audit requirements
SlackThreads vs channels; retention; exclude bots and memes?Noise dominates retrieval if you ingest everything
Database rowsExpose raw rows or only approved “fact sheets”?Schema changes break naive chunking
DuplicatesSame policy PDF in SharePoint and Confluence?Dedup and canonical URL strategy
CompliancePII redaction at ingest? EU residency?Drives region pinning and delete propagation
EvalHuman labels per domain or synthetic only?You cannot tune rerank without domain gold sets

Step 2 — The sixty-second answer

I would build one ingestion plane that normalizes every source into a versioned Document + Chunk model in Postgres, fans out to OpenSearch (lexical) and a vector index with the same chunk ids and ACL metadata, runs hybrid retrieval with ACL filters applied inside the index, reranks with a cross-encoder, and generates with mandatory citations and query/audit logging. Ingest is async and idempotent; embeddings are versioned so we can reindex without downtime.

Failure handling is explicit: connector circuit breakers, dead-letter queues, poison-document quarantine, embedding backlog autoscaling, stale-content markers in the UI, and weekly eval gates before rolling new embed models.

Step 3 — Requirements (functional and non-functional)

Functional

CapabilityBehavior
Unified searchOne query bar across sources; filters by type, space, date, author
Grounded answersLLM uses only retrieved chunks; citations with deep links back to source
FreshnessNear-real-time for Slack; minutes for wiki; scheduled for warehouse exports
Access controlUser sees only chunks their identity can read in the source system
DeletesRemoved Confluence page or retracted Slack message disappears from index
AdminReindex, blocklist noisy spaces, sample queries for debugging

Non-functional (what “production-grade” means here)

DimensionTarget (example you can defend)
Scale~10M logical documents, ~80–150M chunks after chunking
Query volume2k–10k RAG queries/day, bursts at quarter-end
Latencyp95 < 6s: retrieve 400ms, rerank 1.5s, LLM TTFT < 2s
Availability99.9% on read path; ingest can lag with clear UI badge
CostCap $/query via rerank pool size, cache, smaller generator for drafts
AuditLog query, retrieved chunk ids, model version, user id (not raw secrets)

Step 4 — Napkin math (ground the design)

Assume average mix after filtering noise:

Phrase that lands well: “Ten million documents is not one vector database problem—it is a data integration and permissions problem with search attached.”

Step 5 — Unified document model (glue across four sources)

Every source connector writes the same envelope:

Document {
  doc_id, source: pdf|confluence|slack|db,
  source_uri, canonical_id,
  title, author, created_at, updated_at,
  acl: { allow_groups[], allow_users[], deny[] },
  content_hash, language, tags[],
  status: active|deleted|quarantined
}
Chunk {
  chunk_id, doc_id, seq,
  text, token_count,
  struct: { page?, table_id?, thread_ts?, row_pk? },
  embed_model_version
}

Postgres (or similar) is the system of record. Search indexes are disposable projections you rebuild from Postgres + object storage blobs.

Step 6 — End-to-end architecture

flowchart TB
  subgraph sources [Sources]
    PDF[PDF / SharePoint]
    CONF[Confluence]
    SL[Slack]
    DB[(Warehouse / OLTP)]
  end
  subgraph ingest [Ingestion plane]
    CONN[Connectors + CDC]
    NORM[Normalize + ACL map]
    PARSE[Parse / OCR / thread assembly]
    CHUNK[Chunker]
    QUEUE[Kafka ingest topics]
  end
  subgraph index [Indexing plane]
    LEX[(OpenSearch)]
    VEC[(Vector DB)]
    BLOB[(Object storage raw)]
  end
  subgraph serve [Serving plane]
    API[RAG API]
    RET[Hybrid retrieve + ACL]
    RER[Reranker]
    LLM[LLM gateway]
  end
  PDF --> CONN
  CONF --> CONN
  SL --> CONN
  DB --> CONN
  CONN --> NORM --> PARSE --> CHUNK --> QUEUE
  CHUNK --> BLOB
  QUEUE --> LEX
  QUEUE --> VEC
  API --> RET
  RET --> LEX
  RET --> VEC
  RET --> RER --> LLM
    

Step 7 — Ingestion by source (where designs differ)

PDFs

Confluence

Slack

Database records

Step 8 — Chunking (one size does not fit all)

SourceStrategyTypical size
PDF / long wikiStructure-aware: headings + paragraphs; tables as row or mini-table chunks300–600 tokens
ConfluenceRespect heading hierarchy; keep code blocks intact250–500 tokens
Slack threadWhole thread up to cap; split only if > 2k tokens with overlap1 thread = 1–N chunks
DB fact sheetOne row or small group per chunk50–150 tokens

Use parent–child linking: small child chunks for retrieval, parent summary (auto-generated) for LLM context expansion after hit.

Step 9 — Embedding and storage

Embedding pipeline

Why also OpenSearch

Slack handles, ticket ids, error codes, internal acronyms, and exact policy numbers still need lexical match. Production RAG at this scale is hybrid or recall suffers.

Step 10 — Retrieval, re-ranking, generation

Retrieval

  1. Parse query; apply optional filters (source, date, space).
  2. Run BM25 and vector search in parallel with ACL filter pushed into both (filter-first, never score-then-filter).
  3. Fuse with RRF; take top 50–80 candidates.
  4. Per-source caps so Slack does not swamp PDF policy (e.g. max 30% slack hits).

Re-ranking

Cross-encoder on top 40 → keep 12–16 chunks. Cache rerank for repeated enterprise queries (“PTO policy”, “expense report”).

Generation

Step 11 — Failure points and mitigations (core of the question)

StageFailureWhat breaksMitigation
ConnectorAPI rate limit / token expiryStale wiki; silent gapsExponential backoff; checkpoint cursors; alert on lag > SLA
ParseOCR / PDF layout failMissing tablesQuarantine; human QA sample; fallback “text-only” banner
ACL mapWrong group mappingLeak or over-denyNightly ACL reconcile; deny-by-default; security review on mapper
ChunkingSplit mid-table / mid-codeWrong factsStructure-aware chunker; tests on golden pages
Embed queueBacklog after outageUsers see old contentAutoscale workers; priority lane for high-trust sources
Vector indexShard hot spotp95 latency spikesShard by source; rehearse failover; replicate read views
Lexical indexCluster yellow/redDegraded recallMulti-AZ; frozen tier for old docs; circuit break to vector-only with warning
RetrievalFilter-after-searchACL leak in logsFilter-first in engine; never log pre-filter hits
RetrievalSlack noiseJoke answersChannel blocklist; downrank; require min lexical score
RerankGPU pool saturatedTimeoutsQueue + smaller reranker fallback; cap candidates
LLMContext overflowTruncated policyToken budgeter; summarize low-score chunks first
LLMHallucinationWrong guidanceCitations required; verifier on key entities; human feedback loop
OpsReindex bad embed modelCompany-wide wrong answersBlue/green index; rollback pointer; eval gate before cutover

Step 12 — Observability and operations

Step 13 — How to walk through this in a design session

A crisp order that fits a 45-minute panel:

  1. 2 min — clarifiers + security/freshness.
  2. 3 min — napkin math and unified document model.
  3. 10 min — draw three planes: ingest, index, serve (diagram above).
  4. 10 min — deep-dive one source (usually Slack or ACL) and one read path.
  5. 8 min — failure table + mitigations.
  6. 5 min — eval, rollout (shadow mode → % traffic), cost knobs.
  7. 2 min — close with tradeoffs you would defer (fine-tune LLM, agent tools, multi-hop).

Step 14 — Goals → knobs

GoalKnob
Higher recallLarger k, hybrid search, parent-child expansion
Lower latencySmaller rerank pool; cache; pre-warm frequent queries
Lower costCheaper embed model; fewer chunks per doc; tiered storage for old Slack
Safer answersStricter abstain threshold; mandatory citations; blocklist sources
Fresher SlackDedicated fast lane consumer; smaller batches

Step 15 — Close the loop

Whiteboard: three planes, ACL on the retrieval arrow, dual indexes feeding one chunk id.

Out loud: one Confluence page update from webhook → re-chunk → re-embed → visible answer change.

If pushed on MVP: Phase 1 PDF+Confluence hybrid RAG; Phase 2 Slack with blocklists; Phase 3 DB facts; always ACL and audit from day one.

The one line to remember

At ten million documents, production RAG is a permissioned search platform with an LLM on top—ingest and ACL correctness are the product; embeddings and reranking are how you make answers readable, grounded, and fast enough to trust.