RAG for financial PDFs — which search method would you pick?
You are asked to design retrieval for financial PDFs where a single missed number in search can cost millions. This guide is written from an architect’s chair: what I would ship, why plain vector search is not enough, and how each layer works in plain language.
Design prompt
You’re building RAG for financial PDFs. Missing a single number from search can cost millions.
What is the best RAG search method you can use here?
What you should be able to do after reading:
- State a clear answer: hybrid retrieval plus a structured numeric index, not embedding search alone.
- Explain why PDF tables and footnotes break naive chunking.
- Draw the ingest → index → retrieve → rerank → generate → verify path.
- Name eval metrics that actually measure “did we find the right figure?”
Step 0 — How to work the problem
- Restate the failure mode. Wrong prose is bad; wrong revenue, EPS, basis points, or effective date is catastrophic.
- Separate “search” from “answer.” Search must maximize recall of the right evidence; the LLM only summarizes what retrieval returns.
- Assume PDFs are hostile input. Multi-column layout, tables split across pages, headers repeated, numbers in images.
- Pick retrieval for recall, then spend latency on reranking and verification—not on hoping one embedding model “understands finance.”
Step 1 — The short answer
Best practical choice: a hybrid retriever (lexical BM25 + dense vectors) fused with learned reranking, sitting on top of table-aware PDF parsing and a structured numeric/fact index (figures, units, periods, page anchors). Add a mandatory citation + number-consistency check before the user sees the answer.
If you only remember one sentence for a whiteboard: “Vectors find the neighborhood; lexical and structured lookup find the exact million dollars.”
Step 2 — Why this problem is harder than “normal” RAG
Financial PDFs punish generic RAG pipelines in predictable ways:
- Numbers do not embed well.
$4.2Band$4.20Bcan look “similar” in vector space while meaning different things;2024vs2023often cluster together. - Tables are relationships, not paragraphs. Splitting a table every 512 tokens destroys row/column meaning. The model never sees that “Net income” aligns with “Q3 FY24.”
- Context is in headers and footnotes. “In millions, except per share” changes every number on the page. Miss the footnote → wrong magnitude by 1,000×.
- Queries are precise. Users ask “What was adjusted EBITDA in Q2?” not “tell me about performance.” Semantic search alone optimizes for theme, not for the one cell you need.
- The cost of omission is one-sided. In consumer Q&A, a missed chunk means a vague answer. Here it means a confident wrong figure—worse than “I don’t know.”
Step 3 — What I would not rely on alone
| Approach | Why it fails finance PDFs |
|---|---|
| Vector-only (single embedding index) | Weak exact match on digits, tickers, dates; table cells lose structure |
| Fixed-size text chunks (e.g. 500 tokens) | Splits tables and sentences; drops footnotes attached to the wrong chunk |
| HyDE / query expansion only | Can hallucinate extra numbers into the query; hurts precision |
| One reranker, no lexical path | Still misses rare strings (“Series B warrant”, ISIN, GAAP line items) |
| LLM answers without citations | No audit trail; compliance and debugging impossible |
Step 4 — Target architecture (four retrieval paths, one fuse)
Think of retrieval as four specialists that vote, then a judge (reranker) picks the final evidence pack:
flowchart TB
Q[User query] --> ROUTE[Query router]
ROUTE --> LEX[BM25 / lexical index]
ROUTE --> VEC[Dense vector index]
ROUTE --> NUM[Structured numeric index]
ROUTE --> TBL[Table row index]
PDF[PDF ingest] --> PARSE[Layout + table parser]
PARSE --> LEX
PARSE --> VEC
PARSE --> NUM
PARSE --> TBL
LEX --> FUSE[RRF or weighted fusion]
VEC --> FUSE
NUM --> FUSE
TBL --> FUSE
FUSE --> RERANK[Cross-encoder rerank]
RERANK --> PACK[Context pack + citations]
PACK --> LLM[LLM generate]
PACK --> VERIFY[Number verify gate]
LLM --> VERIFY
VERIFY --> OUT[Answer to user]
Path A — Lexical search (BM25 / OpenSearch / Elasticsearch)
What it does: Matches exact tokens—EBITDA, Q2 2024, 14.3%, company names, accounting phrases.
Why you need it: When the user quotes a label or period, lexical search is still the highest-recall way to land on the right paragraph or table row.
How I implement it: Index each logical block (paragraph, table row, footnote) with rich metadata: doc_id, page, section_title, fiscal_period, statement_type (income, balance, cash flow).
Path B — Dense vectors
What it does: Finds passages that are semantically close when wording varies—“cost of revenue” vs “COGS”, “net loss” vs “negative earnings.” Why you still need it: Users do not always use the same words as the PDF. Vectors recover paraphrases lexical search misses.
How I implement it: Embed at the same logical block granularity as BM25 (not random token windows). Store the same metadata on vector ids so both indexes align. Use a finance-tuned or general-purpose embedding model; the bigger win is chunking, not chasing the latest embedding name.
Path C — Structured numeric / fact index
What it does: Treats numbers as first-class records: value, unit, currency, scale (thousands/millions), metric name, period, document, page, surrounding label text. Why it matters: Query “revenue Q3 2024” becomes a lookup + filter problem, not similarity search in 1,536 dimensions.
How I implement it:
- During ingest, run layout-aware extraction (tables + nearby captions). Normalize numbers: detect “in millions” from footnotes and multiply before storage.
- Store in a document store or OLAP-friendly table:
(metric, period, value, unit, doc_id, page, row_text). - At query time, if the parser detects a number, date, quarter, or metric token, query this index first and pin those hits into the context pack (guaranteed inclusion).
Path D — Table row index
Tables deserve their own treatment. Serialize each row as: header row + row label + cell values + page + table caption.
Index that string in both BM25 and vectors, but also keep the raw row JSON for the LLM so columns stay aligned.
Step 5 — Ingestion: PDF parsing is part of search quality
Search cannot fix garbage ingest. Minimum bar I would enforce:
- Layout model (not plain text dump): detect reading order, columns, tables, headers.
- Table extraction to structured rows + CSV-like backup per table.
- Footnote linking: attach scale notes (“except per share”) to every numeric block on that page.
- Provenance always on: every chunk stores
page, bbox, doc_version, hashfor citations. - Human QA sample: 1–2% of pages manually checked; track numeric extraction error rate.
Operational reality: OCR + layout mistakes are your top source of “missing millions.” Budget engineering time here before buying a fancier vector database.
Step 6 — Fusion and reranking (where precision is won)
Fusion
Run BM25 and vector search in parallel (k≈50 each). Merge with reciprocal rank fusion (RRF) or weighted scores. Always inject structured numeric hits into the merged list even if they rank low semantically—recall first for figures.
Reranking
Send the top ~30 fused candidates through a cross-encoder reranker (query + passage pairs). Keep top 8–12 for the LLM context. Reranking is relatively expensive; that is fine—wrong numbers are more expensive.
Optional upgrade: late-interaction models (ColBERT-style) when you need token-level overlap on long table rows. I treat that as v2 after hybrid + cross-encoder is stable.
Step 7 — Query routing (simple rules beat one-size-fits-all)
| Query signal | Retrieval bias |
|---|---|
| Contains digits, %, $, Q1–Q4, FY | Structured numeric index + table rows first |
| Exact GAAP line item (“deferred revenue”) | High BM25 weight + table index |
| Vague strategy question | Higher vector weight on narrative sections |
| Compare two periods | Retrieve both periods explicitly; two-pass retrieval if needed |
| Multi-doc portfolio | Per-doc caps so one 10-K does not crowd out others |
Step 8 — Generation: citations and a number gate
The LLM is not trusted to invent figures. Contract I enforce:
- Every numeric claim must cite
[doc, page, snippet]from the context pack. - Prompt: “If the figure is not in context, say not found—do not estimate.”
- Post-generation verifier (rules + small model): extract numbers from the draft answer; check each appears in cited passages (allow unit normalization); flag mismatches → block or rewrite.
- For high-risk tenants, human approval on answers containing material numbers.
# Simplified retrieval request (conceptual)
POST /retrieve
{
"query": "adjusted EBITDA Q2 FY2024",
"filters": { "doc_type": "10-K", "issuer": "ACME" },
"paths": ["lexical", "vector", "numeric", "tables"],
"fusion": "rrf",
"rerank_top_k": 12,
"pin_numeric_hits": true
}
Step 9 — Napkin math (why hybrid is worth the ops cost)
Illustrative scale for a bank-internal corpus:
- 50k PDFs, ~80 pages average → 4M pages.
- ~6 logical blocks per page → ~24M index records (lexical + vector ids point to same blocks).
- Peak QPS 20, p95 retrieval budget 800ms including rerank on cached warm indexes.
- Structured numeric store: ~40M fact rows after table extraction—fits a column store with partition by
doc_id.
Hybrid adds two indexes and a reranker GPU pool—but avoids the class of errors that have unbounded financial downside.
Step 10 — How I would evaluate “did search find the number?”
Generic RAG benchmarks (nDCG on MS MARCO) lie to you here. Track finance-specific sets:
| Metric | What it measures |
|---|---|
| Numeric recall@k | Given (doc, metric, period), is the correct value in top-k chunks? |
| Table row recall@k | Full row retrieved with headers intact |
| Citation accuracy | Does cited span actually contain the claimed number? |
| Answer number F1 | Extracted figures in answer vs gold vs context |
| Abstention rate | Correct “not in documents” when evidence missing |
Build 200–500 human-labeled questions from real filings where analysts already know the gold page and cell. Regression-test every ingest parser change on this set.
Step 11 — Failure modes and mitigations
- Wrong scale (thousands vs millions) — footnote detector + normalization in structured index.
- Right number, wrong year — period metadata required on every chunk; filter by fiscal_period in query router.
- Scanned PDF / image tables — OCR quality gate; route low-confidence pages to human review queue.
- Retrieval returns narrative, misses table — pin table rows when query has metric tokens; boost table index weight.
- LLM swaps two similar figures — verifier compares digit strings to cited spans before publish.
Step 12 — Goals → knobs
| Goal | Knob |
|---|---|
| Never miss the right figure | Hybrid fusion + pin structured hits + high recall@k before rerank |
| Precision in the context window | Cross-encoder rerank; fewer, richer chunks |
| Auditable answers | Page-level citations; logged retrieval snapshot per query |
| Lower latency | Warm lexical cache; smaller rerank pool; async pre-fetch for known dashboards |
| Lower cost | Cheaper reranker; cache frequent queries; skip vector path on pure lookup queries |
Step 13 — Close the loop
On a whiteboard: draw four retrieval paths into fusion → rerank → cite → verify. Say why vector-only loses on “Q3 revenue $4.2B.”
Out loud: walk one query where BM25 finds the line item, vectors find the paraphrase, structured index locks the cell value.
In production: monitor numeric recall@k weekly; treat parser regressions as Sev-1 for finance tenants.
The one line to remember
For financial PDF RAG, the best search is hybrid retrieval plus structured numbers—lexical and tables for precision, vectors for paraphrase, reranking for the final context, and a verification gate so the model never free-types a figure that search did not support.