RAG with a knowledge base that changes in real time

You are building an AI assistant for a stock trading desk. 8-K filings, headlines, earnings transcripts, and regulatory notices arrive continuously—often many per minute. A trader asking “What just happened on NVDA?” must not get yesterday’s index. This guide designs ingestion, indexing, and cache invalidation so retrieval always reflects the latest truth.

Scenario

Design a RAG system where the knowledge base changes in real time.

You’re building an AI assistant for a stock trading desk. Market filings, news, earnings calls, and regulatory notices come in every minute. Your retrieval index must always reflect the latest information with no stale responses.

How do you design the ingestion, indexing, and cache invalidation pipeline?

What you should be able to do after reading:

Separate freshness SLA by source type (headline vs full 10-K parse).
Use an event-sourced document ledger plus incremental index updates, not nightly rebuilds.
Invalidate caches by ticker + corpus generation, not TTL alone.
Handle corrections, duplicates, and supersession so “latest” does not mean “wrong.”

Step 0 — Define “no stale responses”

Source type	Target ingest→searchable	Why
Breaking headline / alert	< 30 seconds	Desk trades on seconds
Regulatory notice (halt, rule change)	< 60 seconds	Compliance and risk
8-K / filing metadata + summary	1–3 minutes	Fast headline chunk first, full parse async
Earnings call transcript chunk	2–5 minutes (streaming)	Transcript arrives in partial segments
Full PDF filing body	5–15 minutes	Heavy OCR/tables; must not block headline path

Phrase that lands well: “Real-time RAG is not one pipeline—it is a fast lane for what traders need now and a slow lane for depth, both tied to the same document id and version.”

Step 1 — Clarifying questions

Question	Drives
Licensed feeds vs scraped news?	Connector design, dedup keys, legal retention
Material non-public information boundary?	What must never enter the shared index
Query scoped to ticker / sector / desk?	Partitioning and invalidation granularity
Do answers need point-in-time (“as of 9:31am”)?	Versioned retrieval, not only “latest”
Corrections when headline was wrong?	Tombstone + supersede events

Step 2 — The sixty-second answer

Event stream in, versioned documents out. Every item becomes an immutable MarketDocument event (publish, update, supersede, delete) on Kafka. A fast-path indexer embeds headline/summary chunks into a hot tier (per-ticker partitions) within seconds; a slow-path worker completes full filing parse and upserts the same doc_id with higher content_version.

Retrieval uses corpus_generation per partition plus published_at decay. Cache keys include ticker + max_event_seq so any new event for NVDA invalidates NVDA query caches automatically. The API returns as_of_timestamp and source timestamps so the UI can warn if slow-path body is still indexing. Stale answers are prevented by read-after-publish consistency on the hot tier, not by hoping TTL expires.

Step 3 — Architecture overview

flowchart TB
  subgraph feeds [Market feeds]
    EDGAR[Filings EDGAR]
    NEWS[News wire]
    EARN[Earnings stream]
    REG[Regulatory feed]
  end
  subgraph ingest [Ingestion]
    NORM[Normalize + entitlements]
    DEDUP[Dedup + entity link]
    BUS[Event bus - Kafka]
  end
  subgraph ledger [System of record]
    DOC[Document ledger Postgres]
    BLOB[Raw artifact store]
  end
  subgraph index [Indexing planes]
    HOT[Hot vector + lexical per ticker]
    WARM[Warm sharded ANN]
    EMB[Embed workers - priority queues]
  end
  subgraph serve [Serving]
    RET[Retrieval - filter by as_of]
    CACHE[Query cache - generation keyed]
    API[Desk assistant API]
  end
  EDGAR --> NORM
  NEWS --> NORM
  EARN --> NORM
  REG --> NORM
  NORM --> DEDUP --> BUS
  BUS --> DOC
  BUS --> BLOB
  BUS --> EMB
  EMB --> HOT
  EMB --> WARM
  DOC --> RET
  HOT --> RET
  WARM --> RET
  RET --> CACHE --> API
  BUS --> INV[Invalidation publisher]
  INV --> CACHE

Step 4 — Ingestion pipeline (by source)

Common envelope

MarketDocumentEvent {
  event_id, event_type: publish|patch|supersede|delete,
  doc_id, content_version,
  source: filing|news|earnings|regulatory,
  tickers[], sectors[],
  published_at, received_at,
  materiality: headline|summary|full_body,
  acl: desk_entitlements[],
  payload_ref  // S3 pointer or inline headline
}

Per source behavior

Source	Ingest pattern	Fast lane	Slow lane
News wire	Push webhook or streaming socket	Headline + lede paragraph chunk	Full article, related tickers
8-K / filings	Poll EDGAR + vendor push	Item headline, exhibit list metadata	PDF parse, tables, exhibits
Earnings call	Transcript stream (partial JSON)	“Call started” + guidance bullets as they ship	Speaker-labeled full transcript
Regulatory	Low volume, high priority	Entire notice inline (often short)	Cross-links to affected symbols

Dedup and entity linking

Canonical id: hash(source, vendor_story_id) for news; accession_number for filings.
Same story from two wires → one doc_id, merge tickers, bump content_version.
Entity linker maps “Nvidia” / “NVDA” / CIK to ticker partition keys for index routing.

Step 5 — Indexing strategy (incremental, not batch rebuild)

Two-tier indexes

Tier	Contents	Update mode	Latency to searchable
Hot	Last 24–72h per ticker, headlines + alerts	Upsert/delete in place; small HNSW per ticker	Seconds
Warm	Months of chunks, sharded ANN	Incremental upsert; periodic compaction	Minutes for heavy PDFs

Upsert contract

Write event to ledger (source of truth).
Enqueue embed job with priority: regulatory > headline > full body.
On embed complete, atomic upsert vector + lexical row with doc_id, chunk_id, content_version, published_at.
Publish IndexCommitted{partition, max_seq} to invalidation bus.

Supersession and deletes

Supersede: mark old chunks status=inactive in metadata; ANN delete or filter at query time—never return inactive.
Delete / retraction: hard delete from hot tier; tombstone in ledger for audit.
Correction headline: new event with higher content_version; retrieval ranks latest version only.

Step 6 — Cache invalidation (where stale answers actually come from)

Even a fresh index fails if the API cache serves yesterday’s retrieval set.

Cache key design

cache_key = hash(
  normalized_query,
  ticker_filter,
  time_window,
  desk_id,
  hot_partition_generation,  // bumps on any commit to that partition
  warm_corpus_generation       // bumps less often
)

Invalidation triggers

Trigger	Scope	Action
`IndexCommitted` for ticker T	All queries mentioning T	Bump `hot_partition_generation[T]`; purge matching keys
Regulatory global notice	Market-wide	Bump global generation; flush L1 cache
Embed model upgrade	Entire warm tier	New `warm_corpus_generation`; blue/green index
User session “watchlist”	Tickers A,B,C	Subscribe invalidation fanout for those partitions

Patterns that work

Generation counters in Redis per ticker—not scanning millions of cache keys.
Stale-while-revalidate only for non-trading FAQ; disable for market Q&A mode.
Single-flight on cache miss after invalidation to avoid thundering herd on NVDA earnings.

Step 7 — Retrieval: always “latest” without lying

Detect tickers and time intent in query (“today”, “just now”, “since open”).
Search hot tier first with published_at > now - window.
Merge warm results with recency boost; cap stale chunks older than window unless user asks for history.
Attach published_at per citation; UI shows “2m ago via Reuters”.
If hot tier empty but event known in ledger (indexing), respond: “Story indexing—here is headline from ledger” (optional) or abstain—never silent stale.

Step 8 — Generation guardrails for trading

Mandatory citations with source tier (headline vs full filing).
Numbers from filing chunks only, not model memory—link to structured line items when available.
Disclaimer in UI: not investment advice; data may be incomplete during ingest.
Rate limit repeated identical ticker queries to reduce cache churn abuse.

Step 9 — Consistency model (be honest in the design)

Guarantee	What users get
After `IndexCommitted` on hot tier	Read-your-writes for that ticker partition
During slow-path parse	Headline-accurate, body may deepen on refresh
Cross-ticker queries	Eventually consistent within seconds per partition
Global regulatory	Target all partitions notified within 60s

Step 10 — Failure points and mitigations

Failure	Stale symptom	Mitigation
Embed backlog	Headlines in ledger but not searchable	Priority queue; shed full-body jobs; scale GPU
Cache generation not bumped	Fast wrong answer	Index commit must publish invalidation; integration test
Duplicate doc_ids	Contradictory chunks	Dedup at ingest; supersede on merge
Clock skew on published_at	Wrong recency order	UTC from source; NTP; use received_at as tiebreak
Hot tier eviction too aggressive	Miss “this morning” story	TTL by time + ticker activity, not fixed count
Partial transcript	Answer from incomplete call	Mark `completeness=partial` in metadata; UI badge
Wrong ticker link	Irrelevant NVDA news on AMD query	Entity linker confidence threshold; human feedback

Step 11 — Observability

Freshness lag: now - published_at for top-1 retrieved chunk (p50/p95).
Pipeline lag: received_at → IndexCommitted per source type.
Stale cache hits detected when generation in key < current partition generation (should be zero).
Alerts: lag > SLA, invalidation publish failures, hot tier size anomalies.

Step 12 — How to walk through this in a design session

3 min — freshness SLA table by source; fast vs slow lane.
5 min — event bus + document ledger as source of truth.
7 min — hot/warm indexing and upsert/supersede semantics.
8 min — cache invalidation via generation counters (ticker-scoped).
5 min — retrieval recency + “indexing in progress” UX.
Close — failures where cache outruns index.

Step 13 — Goals → knobs

Goal	Knob
Fresher headlines	More hot-tier replicas; higher embed priority
Deeper filing answers	Invest in slow-path OCR; accept minutes delay
Lower cost	Smaller hot window; aggressive warm compaction
Safer compliance	Stricter MNPI filters at ingest; no shared cache across desks

The one line to remember

Live trading-desk RAG is an event-driven indexing product: version every document, update hot partitions in seconds, invalidate caches by ticker generation not luck, and tell the user when depth is still landing—never cache a stale retrieval set on top of a fresh market.