RAG with a knowledge base that changes in real time
You are building an AI assistant for a stock trading desk. 8-K filings, headlines, earnings transcripts, and regulatory notices arrive continuously—often many per minute. A trader asking “What just happened on NVDA?” must not get yesterday’s index. This guide designs ingestion, indexing, and cache invalidation so retrieval always reflects the latest truth.
Scenario
Design a RAG system where the knowledge base changes in real time.
You’re building an AI assistant for a stock trading desk. Market filings, news, earnings calls, and regulatory notices come in every minute. Your retrieval index must always reflect the latest information with no stale responses.
How do you design the ingestion, indexing, and cache invalidation pipeline?
What you should be able to do after reading:
- Separate freshness SLA by source type (headline vs full 10-K parse).
- Use an event-sourced document ledger plus incremental index updates, not nightly rebuilds.
- Invalidate caches by ticker + corpus generation, not TTL alone.
- Handle corrections, duplicates, and supersession so “latest” does not mean “wrong.”
Step 0 — Define “no stale responses”
| Source type | Target ingest→searchable | Why |
|---|---|---|
| Breaking headline / alert | < 30 seconds | Desk trades on seconds |
| Regulatory notice (halt, rule change) | < 60 seconds | Compliance and risk |
| 8-K / filing metadata + summary | 1–3 minutes | Fast headline chunk first, full parse async |
| Earnings call transcript chunk | 2–5 minutes (streaming) | Transcript arrives in partial segments |
| Full PDF filing body | 5–15 minutes | Heavy OCR/tables; must not block headline path |
Phrase that lands well: “Real-time RAG is not one pipeline—it is a fast lane for what traders need now and a slow lane for depth, both tied to the same document id and version.”
Step 1 — Clarifying questions
| Question | Drives |
|---|---|
| Licensed feeds vs scraped news? | Connector design, dedup keys, legal retention |
| Material non-public information boundary? | What must never enter the shared index |
| Query scoped to ticker / sector / desk? | Partitioning and invalidation granularity |
| Do answers need point-in-time (“as of 9:31am”)? | Versioned retrieval, not only “latest” |
| Corrections when headline was wrong? | Tombstone + supersede events |
Step 2 — The sixty-second answer
Event stream in, versioned documents out. Every item becomes an immutable
MarketDocument event (publish, update, supersede, delete) on Kafka.
A fast-path indexer embeds headline/summary chunks into a hot tier (per-ticker partitions) within seconds;
a slow-path worker completes full filing parse and upserts the same doc_id with higher content_version.
Retrieval uses corpus_generation per partition plus published_at decay. Cache keys include ticker + max_event_seq so any new event for NVDA invalidates NVDA query caches automatically. The API returns as_of_timestamp and source timestamps so the UI can warn if slow-path body is still indexing. Stale answers are prevented by read-after-publish consistency on the hot tier, not by hoping TTL expires.
Step 3 — Architecture overview
flowchart TB
subgraph feeds [Market feeds]
EDGAR[Filings EDGAR]
NEWS[News wire]
EARN[Earnings stream]
REG[Regulatory feed]
end
subgraph ingest [Ingestion]
NORM[Normalize + entitlements]
DEDUP[Dedup + entity link]
BUS[Event bus - Kafka]
end
subgraph ledger [System of record]
DOC[Document ledger Postgres]
BLOB[Raw artifact store]
end
subgraph index [Indexing planes]
HOT[Hot vector + lexical per ticker]
WARM[Warm sharded ANN]
EMB[Embed workers - priority queues]
end
subgraph serve [Serving]
RET[Retrieval - filter by as_of]
CACHE[Query cache - generation keyed]
API[Desk assistant API]
end
EDGAR --> NORM
NEWS --> NORM
EARN --> NORM
REG --> NORM
NORM --> DEDUP --> BUS
BUS --> DOC
BUS --> BLOB
BUS --> EMB
EMB --> HOT
EMB --> WARM
DOC --> RET
HOT --> RET
WARM --> RET
RET --> CACHE --> API
BUS --> INV[Invalidation publisher]
INV --> CACHE
Step 4 — Ingestion pipeline (by source)
Common envelope
MarketDocumentEvent {
event_id, event_type: publish|patch|supersede|delete,
doc_id, content_version,
source: filing|news|earnings|regulatory,
tickers[], sectors[],
published_at, received_at,
materiality: headline|summary|full_body,
acl: desk_entitlements[],
payload_ref // S3 pointer or inline headline
}
Per source behavior
| Source | Ingest pattern | Fast lane | Slow lane |
|---|---|---|---|
| News wire | Push webhook or streaming socket | Headline + lede paragraph chunk | Full article, related tickers |
| 8-K / filings | Poll EDGAR + vendor push | Item headline, exhibit list metadata | PDF parse, tables, exhibits |
| Earnings call | Transcript stream (partial JSON) | “Call started” + guidance bullets as they ship | Speaker-labeled full transcript |
| Regulatory | Low volume, high priority | Entire notice inline (often short) | Cross-links to affected symbols |
Dedup and entity linking
- Canonical id:
hash(source, vendor_story_id)for news;accession_numberfor filings. - Same story from two wires → one
doc_id, merge tickers, bumpcontent_version. - Entity linker maps “Nvidia” / “NVDA” / CIK to ticker partition keys for index routing.
Step 5 — Indexing strategy (incremental, not batch rebuild)
Two-tier indexes
| Tier | Contents | Update mode | Latency to searchable |
|---|---|---|---|
| Hot | Last 24–72h per ticker, headlines + alerts | Upsert/delete in place; small HNSW per ticker | Seconds |
| Warm | Months of chunks, sharded ANN | Incremental upsert; periodic compaction | Minutes for heavy PDFs |
Upsert contract
- Write event to ledger (source of truth).
- Enqueue embed job with priority: regulatory > headline > full body.
- On embed complete, atomic upsert vector + lexical row with
doc_id,chunk_id,content_version,published_at. - Publish
IndexCommitted{partition, max_seq}to invalidation bus.
Supersession and deletes
- Supersede: mark old chunks
status=inactivein metadata; ANN delete or filter at query time—never return inactive. - Delete / retraction: hard delete from hot tier; tombstone in ledger for audit.
- Correction headline: new event with higher
content_version; retrieval ranks latest version only.
Step 6 — Cache invalidation (where stale answers actually come from)
Even a fresh index fails if the API cache serves yesterday’s retrieval set.
Cache key design
cache_key = hash( normalized_query, ticker_filter, time_window, desk_id, hot_partition_generation, // bumps on any commit to that partition warm_corpus_generation // bumps less often )
Invalidation triggers
| Trigger | Scope | Action |
|---|---|---|
IndexCommitted for ticker T | All queries mentioning T | Bump hot_partition_generation[T]; purge matching keys |
| Regulatory global notice | Market-wide | Bump global generation; flush L1 cache |
| Embed model upgrade | Entire warm tier | New warm_corpus_generation; blue/green index |
| User session “watchlist” | Tickers A,B,C | Subscribe invalidation fanout for those partitions |
Patterns that work
- Generation counters in Redis per ticker—not scanning millions of cache keys.
- Stale-while-revalidate only for non-trading FAQ; disable for market Q&A mode.
- Single-flight on cache miss after invalidation to avoid thundering herd on NVDA earnings.
Step 7 — Retrieval: always “latest” without lying
- Detect tickers and time intent in query (“today”, “just now”, “since open”).
- Search hot tier first with
published_at > now - window. - Merge warm results with recency boost; cap stale chunks older than window unless user asks for history.
- Attach
published_atper citation; UI shows “2m ago via Reuters”. - If hot tier empty but event known in ledger (
indexing), respond: “Story indexing—here is headline from ledger” (optional) or abstain—never silent stale.
Step 8 — Generation guardrails for trading
- Mandatory citations with source tier (headline vs full filing).
- Numbers from filing chunks only, not model memory—link to structured line items when available.
- Disclaimer in UI: not investment advice; data may be incomplete during ingest.
- Rate limit repeated identical ticker queries to reduce cache churn abuse.
Step 9 — Consistency model (be honest in the design)
| Guarantee | What users get |
|---|---|
After IndexCommitted on hot tier | Read-your-writes for that ticker partition |
| During slow-path parse | Headline-accurate, body may deepen on refresh |
| Cross-ticker queries | Eventually consistent within seconds per partition |
| Global regulatory | Target all partitions notified within 60s |
Step 10 — Failure points and mitigations
| Failure | Stale symptom | Mitigation |
|---|---|---|
| Embed backlog | Headlines in ledger but not searchable | Priority queue; shed full-body jobs; scale GPU |
| Cache generation not bumped | Fast wrong answer | Index commit must publish invalidation; integration test |
| Duplicate doc_ids | Contradictory chunks | Dedup at ingest; supersede on merge |
| Clock skew on published_at | Wrong recency order | UTC from source; NTP; use received_at as tiebreak |
| Hot tier eviction too aggressive | Miss “this morning” story | TTL by time + ticker activity, not fixed count |
| Partial transcript | Answer from incomplete call | Mark completeness=partial in metadata; UI badge |
| Wrong ticker link | Irrelevant NVDA news on AMD query | Entity linker confidence threshold; human feedback |
Step 11 — Observability
- Freshness lag:
now - published_atfor top-1 retrieved chunk (p50/p95). - Pipeline lag: received_at → IndexCommitted per source type.
- Stale cache hits detected when generation in key < current partition generation (should be zero).
- Alerts: lag > SLA, invalidation publish failures, hot tier size anomalies.
Step 12 — How to walk through this in a design session
- 3 min — freshness SLA table by source; fast vs slow lane.
- 5 min — event bus + document ledger as source of truth.
- 7 min — hot/warm indexing and upsert/supersede semantics.
- 8 min — cache invalidation via generation counters (ticker-scoped).
- 5 min — retrieval recency + “indexing in progress” UX.
- Close — failures where cache outruns index.
Step 13 — Goals → knobs
| Goal | Knob |
|---|---|
| Fresher headlines | More hot-tier replicas; higher embed priority |
| Deeper filing answers | Invest in slow-path OCR; accept minutes delay |
| Lower cost | Smaller hot window; aggressive warm compaction |
| Safer compliance | Stricter MNPI filters at ingest; no shared cache across desks |
The one line to remember
Live trading-desk RAG is an event-driven indexing product: version every document, update hot partitions in seconds, invalidate caches by ticker generation not luck, and tell the user when depth is still landing—never cache a stale retrieval set on top of a fresh market.