sharpbyte.dev
← Design guide
Interview ready · Design · Section 5

Data pipelines & knowledge

Fifteen staff-depth scenarios on turning messy enterprise content into trustworthy model inputs: real-time ingestion with ordered jobs and honest failure classes; composite versioning and blue/green embedding migrations; 10M-doc scale patterns; dedupe that respects ACLs; connector fleets for Confluence/SharePoint/Notion/S3; pre-LLM ACL enforcement; PDF extraction with staging warehouses; DLQ and security for parsers; audit-grade lineage; staleness automation; non-disruptive vector metadata evolution; CDC with tombstones; data-quality layers (PII, boilerplate, spam); and knowledge graphs that complement—not replace—vector search.

Interview stance. Knowledge pipelines are ETL under chaos: vendors rotate OAuth, PDFs fight you, and Legal still owns ACLs. Panels want end-to-end SLOs from author edit → chunk in index → honest staleness in the UI—not a green checkmark on “batch finished.”

71. How would you design a real-time document ingestion pipeline that feeds a vector database for a RAG system?

Treat realtime as bounded latency ETL, not magic: upstream events (S3 ObjectCreated, Kafka doc topics, SaaS webhooks, SCIM-driven group changes) normalize into a single ingestion job envelope with tenant_id, source revision, content hash, and ACL hints so downstream steps never guess identity.

Queue discipline. Use partitioned streams per tenant or shard to preserve ordering for the same document id; cap depth and shed with priority classes so one noisy customer cannot starve SLA tenants. Dead-letter with replay after connector fixes.

Pipeline stages. Fetch → virus scan / size caps → parse (possibly multi-format) → dedupe → chunk → embed batch → upsert vectors + row metadata. Emit stage timestamps to a single “edit→searchable” lag histogram your dashboard actually reads.

Backpressure. When embed workers saturate, slow producers via consumer lag or push back on webhooks (retry with Retry-After). Prefer degrading batch breadth before dropping correctness (never silently skip ACL writes).

Failure semantics. Distinguish transient vendor 429/5xx from permanent auth revocation; surface connector health in an ops UI so support can say “SharePoint token expired” not “search feels weird.”

Stream ingestion

flowchart LR
  EV[Upload event] --> Q[Queue]
  Q --> W[Parse embed]
  W --> V[(Vector DB)]
  W --> META[(Metadata)]
            

72. How would you handle versioning of documents and embeddings in a production knowledge base?

Version is not a single integer—it is a composite key: (embedding_model_id, model_revision, chunking_policy_id, normalization_version) plus the source system’s document revision. Anything less and you cannot explain why Tuesday’s answer differs from Monday’s.

Immutable chunks. When content changes, write new chunk rows; tombstone or supersede old vectors by stable ids so retrievals never alias conflicting revisions. Keep a mapping table from doc_rev → active chunk set for fast rollback.

Blue/green indexes. Build collection kb_2026_05_vnext offline, run offline eval (Recall@k, contamination checks), flip routing in one flag, keep previous collection read-only for instant rollback if live metrics regress.

Compatibility window. During migration, dual-write or serve blended retrieval only with explicit provenance in the answer (“indexed with legacy embedder until June 1”). Close the window or you inherit permanent tech debt.

Interview sharp edge. Call out that re-embedding without re-chunking can move boundaries and change semantics—sometimes you must version chunking too, not only vectors.

73. How would you design a pipeline to process, chunk, embed, and index 10 million documents efficiently?

Work breakdown. Store canonical pointers in a job table sharded by source; workers claim batches with least-assigned scheduling. Avoid central coordinator hot spots—use lease timeouts and idempotent stages so crashes only lose seconds of work.

Embedding throughput. Batch to GPU inference (dynamic batching within latency budgets), pin similar sequence lengths together, and cache embeddings for byte-identical chunks resurfacing across tenants after legal holds lift.

Storage layout. Raw files in object storage; chunk metadata in something query-friendly (columnar warehouse or wide KV). Bulk-load vectors with vendor-native import paths before falling back to per-vector HTTP upserts.

Observability at scale. Hourly random QA samples (parse coverage, language detection, empty chunk rate), alarm on embed error spikes, and cost meters (USD per 1k docs) so finance sees the burn curve during the multi-day run.

Resumability. Checkpoint per partition; support draining and resuming after code deploys. Long jobs will ship bugfixes mid-flight—design merge-safe state transitions.

74. How would you detect and handle duplicate or near-duplicate documents in an ingestion pipeline?

Exact dedupe. Cryptographic hash over normalized bytes (strip BOM, normalize line endings) keyed by tenant+path. Collisions across tenants may still be allowed; within a library, pick authoritative source based on path depth, modified time, or manual “golden” label.

Near-dupes. SimHash/MinHash shingles for cheap clustering; optional embedding distance for refereeing borderline pairs. Tune thresholds per content type—legal contracts tolerate fewer false merges than internal wikis.

Operational policy. Decide whether duplicates merge into one logical doc id (single embedding budget) or stay separate with cross-links. Always preserve audit metadata listing suppressed copies.

Cost angle. Dedupe before embed when possible; recompute hashes incrementally when connectors send “content unchanged” signals to skip GPU work.

Gotcha. Permission differences: two byte-identical PDFs with different ACLs are not the same document—hash content and carry union-diff of ACL metadata explicitly.

75. How would you design a pipeline to keep embeddings up to date when the embedding model is upgraded?

Treat upgrades like database migrations: plan, measure, roll forward with rollback, communicate to customers relying on semantic search SLAs.

Dual index bake-off. Shadow-index representative tenants; run retrieval eval suites (gold questions + human grades) comparing nDCG/Recall@k and drift on sensitive queries before any traffic moves.

Phased re-embed. Wave by tenant tier or geography; persist cursor + per-doc status. Rate-limit API spend; pause waves automatically if error budgets burn.

Query router. During overlap, route retrieval to the index matching the user’s contract (“legacy until Q3”) or blend with explicit ranking fusion—never silently merge incompatible vector spaces without calibration.

Documentation + support. Publish embedding model cards (dims, language coverage, expiry) and runbooks for rollback. Interviewers like hearing you coordinate with PM/legal on forward notice.

76. How would you architect a multi-source ingestion pipeline that pulls from Confluence, SharePoint, Notion, and S3?

Each connector implements the same contract: enumerate (pagination + cursors), fetch binary + metadata, normalize into a canonical HTML/Markdown-ish IR with attachment sidecars, map ACLs into your internal principal ids, emit provenance for lineage.

Rate limits & politeness. Token buckets per vendor + per tenant OAuth app; exponential backoff on 429/503 with jitter; global concurrency cap so you never trip enterprise-wide bans during initial backfills.

State store. Central table of external ids, delta tokens, last successful sync, consecutive failures, and content ETags so incremental sync is cheap and auditable. Dashboard per connector for SRE ownership.

Unified schema ≠ unified behavior. Confluence macros, SharePoint versioning, and Notion databases need adapters—hide quirks behind tests that replay recorded fixtures so refactors do not break prod silently.

Human workflows. Some sources require admin consent rotation; build self-service reconnect flows instead of paging platform every ninety days.

Example. SharePoint uses delta tokens; Confluence has different pagination—tests mock each vendor’s quirks.

77. How would you implement document-level access control in a knowledge base used by a RAG system?

Source of truth. IAM lives in IdP + SaaS (SharePoint ACLs, Confluence restrictions). At ingest, resolve groups to stable internal principal ids; store allow:principals or boolean expressions on every chunk row—never “search first, pray later.”

Query path. Gateway applies the intersection of user principals, impersonation scope, and data residency flags as mandatory vector/metadata filters before any LLM sees text. Log filter predicates (redacted) for debugging leakage incidents.

Latency vs safety. Precompute expanded group membership periodically for hot paths; fall back to live resolution for high-assurance modes. Reject the query if ACL expansion fails closed while in “strict” org mode.

Edge cases. Broken group sync is a Sev2: if HR doc ACLs drift, you may silently over-share. Monitor staleness (now - membership_cache_updated_at) and block ingest when source auth tokens die.

Interview clarity. Mention public vs internal corpora separation, optional physical isolation for highly confidential shards, and that embeddings alone are not secret—metadata enforcement still matters.

78. How would you design a pipeline that extracts structured data from unstructured PDFs at scale?

OCR and layout. Run discriminators for digital vs scanned; invoke OCR with language packs when needed. Use layout-aware parsers (blocks, reading order) so tables do not arrive as scrambled lines.

Two-track strategy. Heuristics + rules for clean forms; vision or large models for messy bundles—route by confidence scores and doc archetype (invoice vs contract vs spec sheet).

Structured landing zone. Emit typed rows (line items, parties, dates) into a staging lake with source bbox citations for human review and downstream SQL—not straight into vectors if finance depends on decimals.

Quality gates. Sampled HITL for low-confidence fields, adversarial tests on rotated pages and multi-column layouts, schema validators (non-null keys, currency codes).

Cost control. Cache by page hash; short-circuit unchanged pages on revision 2+; batch expensive model calls overnight for bulk backfills.

79. How would you handle parsing failures, corrupt files, and unsupported file types in an ingestion pipeline?

Taxonomy of failure. UNSUPPORTED_MIME, TIMEOUT, DECODE_ERROR, MALWARE_QUARANTINE, VENDOR_5XX—each maps to user-facing copy, retry policy, and ops playbooks.

Retries. Transient errors with capped attempts and jitter; never infinitely spin on a 12 GB “presentation” masked as PDF. Hard per-stage CPU and wall-clock ceilings.

DLQ & UX. Surface failures in admin consoles with downloadable forensics (redacted); optionally email document owners with fix guidance (‘password-protected file’).

Security. Virus scan before extractors; sandbox parsers that invoke native libraries; block macros and dangerous OLE.

Analytics. Track failure rate by connector and template—often reveals a bot uploading camera photos of monitors, which is a product training problem not an LLM problem.

80. How would you design a data lineage system that tracks which source document a generated answer came from?

Minimum viable trace. Persist retrieval_trace_id with chunk ids, embedding collection version, doc revision, and generator model. Join in warehouse to answer ‘what evidence existed at T?’ for audits.

User-facing handles. Short opaque ids in footnotes users can give Support without leaking internals; deep links for admins show highlighted passages.

Immutable logs. Retrieval logs should be append-only for compliance; if PII must be redacted later, store hashes parallel to display-safe snippets.

Cross-system links. Map chunk ids to CRM case ids or git SHAs when tech docs are the source so investigations span teams.

Interview trap. If you only log final prompts, you cannot prove which chunk justified a claim after templates change—log structured retrieval, not screenshots of chat.

81. How would you design an automated pipeline that identifies outdated or stale knowledge in the vector store?

Signals. Upstream Last-Modified ahead of index timestamp; scheduled TTL for policies and prices; HRIS-driven owner departures; decay when clicks/rewards drop despite high raw scores.

Automation. Refresh queues prioritized by criticality (price lists before blog posts); batch delete tombstones when sources vanish.

Human nudges. Weekly digest to doc owners with ‘most-stale high-traffic’ lists; gamify fixes internal tooling teams already use.

RAG UX. Surfacing ‘last indexed 180d ago’ warnings for finance/legal answers improves trust and lowers hallucination blame.

Metrics. Track % corpus fresher than N days, mean time-to-refresh after edit, and incidents tied to stale snippets.

82. How would you manage schema changes in a vector database without downtime?

Compatibility layers. Treat metadata fields like API schemas—additive changes default-on, breaking changes behind dual-read/dual-write windows unless you can rebuild overnight.

Blue/green collections. New payload fields? Backfill asynchronously, verify parity on sampled keys, then atomically flip router. Keep previous collection attached for rollback during bake period.

Feature flags. Clients opt into filters using new fields only after backfill completion % crosses threshold; server rejects inconsistent combinations early to avoid ‘empty search’ mysteries.

Operational hygiene. Document SLO for backfill duration; chaos-test flip/rollback monthly so muscle memory exists before SEC asks.

Nuance. Some vendors require full reindex for index-type changes—say that aloud; panels reward folks who read vendor release notes.

83. How would you build a change data capture (CDC) pipeline that syncs a relational database to a vector store in near-real-time?

Capture. Debezium or native logical replication into a stream; normalize events into upsert/delete operations keyed by primary key with monotonic offsets per table partition.

Text projection. Deterministic SQL→sentence templates (‘Customer {{name}} status {{status}}’) so embeddings stay stable unless business fields truly change—avoid dumping random column order.

Ordering & tombstones. Apply per-key serial processing; deletes must propagate as vector tombstones or you leak PII of churned users. Handle out-of-order snapshots during initial sync.

Lag observability. Expose cdc_lag_ms and stall detectors; RAG can gate answers with freshness banners when lag > business threshold.

Interview tie-in. Same reasoning as search indexes—vectors are just another derived view of OLTP.

CDC outbox pattern

flowchart LR
  DB[(OLTP DB)] --> CDC[CDC log]
  CDC --> TR[Transform to text]
  TR --> EMB[Embed]
  EMB --> V[(Vector index)]
            

84. How would you design a data quality layer in the LLM data pipeline (PII removal, deduplication, relevance filtering)?

PII & secrets. Stack regexes, NER, and org-specific dictionaries (employee IDs, project codenames); redact consistently in logs and training exports so eval reflects production.

Dedup & boilerplate. Strip chrome (nav bars, footers) via DOM rules for HTML and heuristics for PDFs; drop null chunks; merge micro-fragments created by over-aggressive splits.

Toxic or off-topic. Lightweight classifiers to downrank spam uploads and ‘template-only’ pages before they burn embedding budget or pollute retrieval.

Measurement. Sample human-labeled quality sets; track precision/recall of PII detector on tricky locales; alert when new PowerPoint theme trips boilerplate rules.

Contractual. Some customers require on-prem detection only—design swappable backends behind one interface.

85. How would you architect a knowledge graph that complements a vector search index in an enterprise knowledge management system?

Division of labor. Graph holds authoritative entities and typed edges (owns, depends_on, located_at); vectors handle fuzzy wording and long-tail paraphrase. Ingestion links graph nodes to evidence chunks with provenance.

Query orchestration. Retrieve vector seeds, expand 1–k hops with ACL-aware graph queries, pack both subgraph summary and literal quotes for the LLM—reduces ‘confident wrong relation’ failures.

Maintenance. Graph schema migration is painful—version edge types, run conformance checks, and detect orphans when documents delete.

Scale. Materialize hot traversals; cache neighborhood summaries for recurring executive questions; federate if multiple graph shards by region.

Interview candor. GraphRAG is not free: entity resolution mistakes propagate; invest in human-reviewed golden entities for critical domains.

Recap — this section

QTakeaway
71Normalized job envelope; ordered partitions; stage-level lag; backpressure without silent ACL loss; classified connector failures.
72Composite versions; immutable chunk supersession; blue/green with offline eval; finite dual-serve windows; chunk policy versioning.
73Sharded idempotent jobs; GPU batching + byte-level embed cache; bulk vector load; statistical QA + cost telemetry; partition checkpoints.
74Normalized hashing; fuzzy clustering with domain thresholds; explicit merge vs link policies; ACL-derived equivalence.
75Eval-gated dual indexes; phased cursor-driven re-embed; router/fusion discipline; customer-visible rollback story.
76Canonical connector interface; per-vendor throttling + cursor store; IR normalization; fixture-tested adapters; admin UX for OAuth pain.
77Authoritative external IAM mirrored on chunks; mandatory pre-LLM filters; membership freshness SLOs; fail-closed strict modes; separate tiers for secret corpora.
78Layout + OCR routing; archetype-specific extractors; bbox-grounded staging tables; sampled HITL; page-level caching.
79Typed error taxonomy; bounded retries and CPU caps; owner-visible DLQ; sandboxed parsing; failure dashboards by source.
80Structured retrieval traces + collection versions; support-grade references; append-only logs; cross-refs to business objects.
81Multi-signal freshness scoring; prioritized re-ingest; owner workflows; candid staleness UX; operational KPIs.
82Additive-first metadata; blue/green reindex; flag-gated client filters; rehearsed rollback; honesty about vendor limits.
83Keyed ordering + deletes as first-class; stable text projections; lag metrics; OLTP parity mindset.
84Layered PII handling with org lexicon; boilerplate stripping; spam classifiers; labeled monitoring; deployment modes per tenant.
85KG for structure, vectors for fuzz; hybrid packs; schema governance; cached traversals; entity-resolution realism.

← Section 4 · This section · Design hub · Section 6 →