sharpbyte.dev

RAG with a knowledge base that changes in real time

You are building an AI assistant for a stock trading desk. 8-K filings, headlines, earnings transcripts, and regulatory notices arrive continuously—often many per minute. A trader asking “What just happened on NVDA?” must not get yesterday’s index. This guide designs ingestion, indexing, and cache invalidation so retrieval always reflects the latest truth.

Scenario

Design a RAG system where the knowledge base changes in real time.

You’re building an AI assistant for a stock trading desk. Market filings, news, earnings calls, and regulatory notices come in every minute. Your retrieval index must always reflect the latest information with no stale responses.

How do you design the ingestion, indexing, and cache invalidation pipeline?

What you should be able to do after reading:

Step 0 — Define “no stale responses”

Source typeTarget ingest→searchableWhy
Breaking headline / alert< 30 secondsDesk trades on seconds
Regulatory notice (halt, rule change)< 60 secondsCompliance and risk
8-K / filing metadata + summary1–3 minutesFast headline chunk first, full parse async
Earnings call transcript chunk2–5 minutes (streaming)Transcript arrives in partial segments
Full PDF filing body5–15 minutesHeavy OCR/tables; must not block headline path

Phrase that lands well: “Real-time RAG is not one pipeline—it is a fast lane for what traders need now and a slow lane for depth, both tied to the same document id and version.”

Step 1 — Clarifying questions

QuestionDrives
Licensed feeds vs scraped news?Connector design, dedup keys, legal retention
Material non-public information boundary?What must never enter the shared index
Query scoped to ticker / sector / desk?Partitioning and invalidation granularity
Do answers need point-in-time (“as of 9:31am”)?Versioned retrieval, not only “latest”
Corrections when headline was wrong?Tombstone + supersede events

Step 2 — The sixty-second answer

Event stream in, versioned documents out. Every item becomes an immutable MarketDocument event (publish, update, supersede, delete) on Kafka. A fast-path indexer embeds headline/summary chunks into a hot tier (per-ticker partitions) within seconds; a slow-path worker completes full filing parse and upserts the same doc_id with higher content_version.

Retrieval uses corpus_generation per partition plus published_at decay. Cache keys include ticker + max_event_seq so any new event for NVDA invalidates NVDA query caches automatically. The API returns as_of_timestamp and source timestamps so the UI can warn if slow-path body is still indexing. Stale answers are prevented by read-after-publish consistency on the hot tier, not by hoping TTL expires.

Step 3 — Architecture overview

flowchart TB
  subgraph feeds [Market feeds]
    EDGAR[Filings EDGAR]
    NEWS[News wire]
    EARN[Earnings stream]
    REG[Regulatory feed]
  end
  subgraph ingest [Ingestion]
    NORM[Normalize + entitlements]
    DEDUP[Dedup + entity link]
    BUS[Event bus - Kafka]
  end
  subgraph ledger [System of record]
    DOC[Document ledger Postgres]
    BLOB[Raw artifact store]
  end
  subgraph index [Indexing planes]
    HOT[Hot vector + lexical per ticker]
    WARM[Warm sharded ANN]
    EMB[Embed workers - priority queues]
  end
  subgraph serve [Serving]
    RET[Retrieval - filter by as_of]
    CACHE[Query cache - generation keyed]
    API[Desk assistant API]
  end
  EDGAR --> NORM
  NEWS --> NORM
  EARN --> NORM
  REG --> NORM
  NORM --> DEDUP --> BUS
  BUS --> DOC
  BUS --> BLOB
  BUS --> EMB
  EMB --> HOT
  EMB --> WARM
  DOC --> RET
  HOT --> RET
  WARM --> RET
  RET --> CACHE --> API
  BUS --> INV[Invalidation publisher]
  INV --> CACHE
    

Step 4 — Ingestion pipeline (by source)

Common envelope

MarketDocumentEvent {
  event_id, event_type: publish|patch|supersede|delete,
  doc_id, content_version,
  source: filing|news|earnings|regulatory,
  tickers[], sectors[],
  published_at, received_at,
  materiality: headline|summary|full_body,
  acl: desk_entitlements[],
  payload_ref  // S3 pointer or inline headline
}

Per source behavior

SourceIngest patternFast laneSlow lane
News wirePush webhook or streaming socketHeadline + lede paragraph chunkFull article, related tickers
8-K / filingsPoll EDGAR + vendor pushItem headline, exhibit list metadataPDF parse, tables, exhibits
Earnings callTranscript stream (partial JSON)“Call started” + guidance bullets as they shipSpeaker-labeled full transcript
RegulatoryLow volume, high priorityEntire notice inline (often short)Cross-links to affected symbols

Dedup and entity linking

Step 5 — Indexing strategy (incremental, not batch rebuild)

Two-tier indexes

TierContentsUpdate modeLatency to searchable
HotLast 24–72h per ticker, headlines + alertsUpsert/delete in place; small HNSW per tickerSeconds
WarmMonths of chunks, sharded ANNIncremental upsert; periodic compactionMinutes for heavy PDFs

Upsert contract

  1. Write event to ledger (source of truth).
  2. Enqueue embed job with priority: regulatory > headline > full body.
  3. On embed complete, atomic upsert vector + lexical row with doc_id, chunk_id, content_version, published_at.
  4. Publish IndexCommitted{partition, max_seq} to invalidation bus.

Supersession and deletes

Step 6 — Cache invalidation (where stale answers actually come from)

Even a fresh index fails if the API cache serves yesterday’s retrieval set.

Cache key design

cache_key = hash(
  normalized_query,
  ticker_filter,
  time_window,
  desk_id,
  hot_partition_generation,  // bumps on any commit to that partition
  warm_corpus_generation       // bumps less often
)

Invalidation triggers

TriggerScopeAction
IndexCommitted for ticker TAll queries mentioning TBump hot_partition_generation[T]; purge matching keys
Regulatory global noticeMarket-wideBump global generation; flush L1 cache
Embed model upgradeEntire warm tierNew warm_corpus_generation; blue/green index
User session “watchlist”Tickers A,B,CSubscribe invalidation fanout for those partitions

Patterns that work

Step 7 — Retrieval: always “latest” without lying

  1. Detect tickers and time intent in query (“today”, “just now”, “since open”).
  2. Search hot tier first with published_at > now - window.
  3. Merge warm results with recency boost; cap stale chunks older than window unless user asks for history.
  4. Attach published_at per citation; UI shows “2m ago via Reuters”.
  5. If hot tier empty but event known in ledger (indexing), respond: “Story indexing—here is headline from ledger” (optional) or abstain—never silent stale.

Step 8 — Generation guardrails for trading

Step 9 — Consistency model (be honest in the design)

GuaranteeWhat users get
After IndexCommitted on hot tierRead-your-writes for that ticker partition
During slow-path parseHeadline-accurate, body may deepen on refresh
Cross-ticker queriesEventually consistent within seconds per partition
Global regulatoryTarget all partitions notified within 60s

Step 10 — Failure points and mitigations

FailureStale symptomMitigation
Embed backlogHeadlines in ledger but not searchablePriority queue; shed full-body jobs; scale GPU
Cache generation not bumpedFast wrong answerIndex commit must publish invalidation; integration test
Duplicate doc_idsContradictory chunksDedup at ingest; supersede on merge
Clock skew on published_atWrong recency orderUTC from source; NTP; use received_at as tiebreak
Hot tier eviction too aggressiveMiss “this morning” storyTTL by time + ticker activity, not fixed count
Partial transcriptAnswer from incomplete callMark completeness=partial in metadata; UI badge
Wrong ticker linkIrrelevant NVDA news on AMD queryEntity linker confidence threshold; human feedback

Step 11 — Observability

Step 12 — How to walk through this in a design session

  1. 3 min — freshness SLA table by source; fast vs slow lane.
  2. 5 min — event bus + document ledger as source of truth.
  3. 7 min — hot/warm indexing and upsert/supersede semantics.
  4. 8 min — cache invalidation via generation counters (ticker-scoped).
  5. 5 min — retrieval recency + “indexing in progress” UX.
  6. Close — failures where cache outruns index.

Step 13 — Goals → knobs

GoalKnob
Fresher headlinesMore hot-tier replicas; higher embed priority
Deeper filing answersInvest in slow-path OCR; accept minutes delay
Lower costSmaller hot window; aggressive warm compaction
Safer complianceStricter MNPI filters at ingest; no shared cache across desks

The one line to remember

Live trading-desk RAG is an event-driven indexing product: version every document, update hot partitions in seconds, invalidate caches by ticker generation not luck, and tell the user when depth is still landing—never cache a stale retrieval set on top of a fresh market.