RAG for financial PDFs — which search method would you pick?

You are asked to design retrieval for financial PDFs where a single missed number in search can cost millions. This guide is written from an architect’s chair: what I would ship, why plain vector search is not enough, and how each layer works in plain language.

Design prompt

You’re building RAG for financial PDFs. Missing a single number from search can cost millions.

What is the best RAG search method you can use here?

What you should be able to do after reading:

State a clear answer: hybrid retrieval plus a structured numeric index, not embedding search alone.
Explain why PDF tables and footnotes break naive chunking.
Draw the ingest → index → retrieve → rerank → generate → verify path.
Name eval metrics that actually measure “did we find the right figure?”

Step 0 — How to work the problem

Restate the failure mode. Wrong prose is bad; wrong revenue, EPS, basis points, or effective date is catastrophic.
Separate “search” from “answer.” Search must maximize recall of the right evidence; the LLM only summarizes what retrieval returns.
Assume PDFs are hostile input. Multi-column layout, tables split across pages, headers repeated, numbers in images.
Pick retrieval for recall, then spend latency on reranking and verification—not on hoping one embedding model “understands finance.”

Step 1 — The short answer

Best practical choice: a hybrid retriever (lexical BM25 + dense vectors) fused with learned reranking, sitting on top of table-aware PDF parsing and a structured numeric/fact index (figures, units, periods, page anchors). Add a mandatory citation + number-consistency check before the user sees the answer.

If you only remember one sentence for a whiteboard: “Vectors find the neighborhood; lexical and structured lookup find the exact million dollars.”

Step 2 — Why this problem is harder than “normal” RAG

Financial PDFs punish generic RAG pipelines in predictable ways:

Numbers do not embed well. $4.2B and $4.20B can look “similar” in vector space while meaning different things; 2024 vs 2023 often cluster together.
Tables are relationships, not paragraphs. Splitting a table every 512 tokens destroys row/column meaning. The model never sees that “Net income” aligns with “Q3 FY24.”
Context is in headers and footnotes. “In millions, except per share” changes every number on the page. Miss the footnote → wrong magnitude by 1,000×.
Queries are precise. Users ask “What was adjusted EBITDA in Q2?” not “tell me about performance.” Semantic search alone optimizes for theme, not for the one cell you need.
The cost of omission is one-sided. In consumer Q&A, a missed chunk means a vague answer. Here it means a confident wrong figure—worse than “I don’t know.”

Step 3 — What I would not rely on alone

Approach	Why it fails finance PDFs
Vector-only (single embedding index)	Weak exact match on digits, tickers, dates; table cells lose structure
Fixed-size text chunks (e.g. 500 tokens)	Splits tables and sentences; drops footnotes attached to the wrong chunk
HyDE / query expansion only	Can hallucinate extra numbers into the query; hurts precision
One reranker, no lexical path	Still misses rare strings (“Series B warrant”, ISIN, GAAP line items)
LLM answers without citations	No audit trail; compliance and debugging impossible

Step 4 — Target architecture (four retrieval paths, one fuse)

Think of retrieval as four specialists that vote, then a judge (reranker) picks the final evidence pack:

flowchart TB
  Q[User query] --> ROUTE[Query router]
  ROUTE --> LEX[BM25 / lexical index]
  ROUTE --> VEC[Dense vector index]
  ROUTE --> NUM[Structured numeric index]
  ROUTE --> TBL[Table row index]
  PDF[PDF ingest] --> PARSE[Layout + table parser]
  PARSE --> LEX
  PARSE --> VEC
  PARSE --> NUM
  PARSE --> TBL
  LEX --> FUSE[RRF or weighted fusion]
  VEC --> FUSE
  NUM --> FUSE
  TBL --> FUSE
  FUSE --> RERANK[Cross-encoder rerank]
  RERANK --> PACK[Context pack + citations]
  PACK --> LLM[LLM generate]
  PACK --> VERIFY[Number verify gate]
  LLM --> VERIFY
  VERIFY --> OUT[Answer to user]

Path A — Lexical search (BM25 / OpenSearch / Elasticsearch)

What it does: Matches exact tokens—EBITDA, Q2 2024, 14.3%, company names, accounting phrases. Why you need it: When the user quotes a label or period, lexical search is still the highest-recall way to land on the right paragraph or table row.

How I implement it: Index each logical block (paragraph, table row, footnote) with rich metadata: doc_id, page, section_title, fiscal_period, statement_type (income, balance, cash flow).

Path B — Dense vectors

What it does: Finds passages that are semantically close when wording varies—“cost of revenue” vs “COGS”, “net loss” vs “negative earnings.” Why you still need it: Users do not always use the same words as the PDF. Vectors recover paraphrases lexical search misses.

How I implement it: Embed at the same logical block granularity as BM25 (not random token windows). Store the same metadata on vector ids so both indexes align. Use a finance-tuned or general-purpose embedding model; the bigger win is chunking, not chasing the latest embedding name.

Path C — Structured numeric / fact index

What it does: Treats numbers as first-class records: value, unit, currency, scale (thousands/millions), metric name, period, document, page, surrounding label text. Why it matters: Query “revenue Q3 2024” becomes a lookup + filter problem, not similarity search in 1,536 dimensions.

How I implement it:

During ingest, run layout-aware extraction (tables + nearby captions). Normalize numbers: detect “in millions” from footnotes and multiply before storage.
Store in a document store or OLAP-friendly table: (metric, period, value, unit, doc_id, page, row_text).
At query time, if the parser detects a number, date, quarter, or metric token, query this index first and pin those hits into the context pack (guaranteed inclusion).

Path D — Table row index

Tables deserve their own treatment. Serialize each row as: header row + row label + cell values + page + table caption. Index that string in both BM25 and vectors, but also keep the raw row JSON for the LLM so columns stay aligned.

Step 5 — Ingestion: PDF parsing is part of search quality

Search cannot fix garbage ingest. Minimum bar I would enforce:

Layout model (not plain text dump): detect reading order, columns, tables, headers.
Table extraction to structured rows + CSV-like backup per table.
Footnote linking: attach scale notes (“except per share”) to every numeric block on that page.
Provenance always on: every chunk stores page, bbox, doc_version, hash for citations.
Human QA sample: 1–2% of pages manually checked; track numeric extraction error rate.

Operational reality: OCR + layout mistakes are your top source of “missing millions.” Budget engineering time here before buying a fancier vector database.

Step 6 — Fusion and reranking (where precision is won)

Fusion

Run BM25 and vector search in parallel (k≈50 each). Merge with reciprocal rank fusion (RRF) or weighted scores. Always inject structured numeric hits into the merged list even if they rank low semantically—recall first for figures.

Reranking

Send the top ~30 fused candidates through a cross-encoder reranker (query + passage pairs). Keep top 8–12 for the LLM context. Reranking is relatively expensive; that is fine—wrong numbers are more expensive.

Optional upgrade: late-interaction models (ColBERT-style) when you need token-level overlap on long table rows. I treat that as v2 after hybrid + cross-encoder is stable.

Step 7 — Query routing (simple rules beat one-size-fits-all)

Query signal	Retrieval bias
Contains digits, %, $, Q1–Q4, FY	Structured numeric index + table rows first
Exact GAAP line item (“deferred revenue”)	High BM25 weight + table index
Vague strategy question	Higher vector weight on narrative sections
Compare two periods	Retrieve both periods explicitly; two-pass retrieval if needed
Multi-doc portfolio	Per-doc caps so one 10-K does not crowd out others

Step 8 — Generation: citations and a number gate

The LLM is not trusted to invent figures. Contract I enforce:

Every numeric claim must cite [doc, page, snippet] from the context pack.
Prompt: “If the figure is not in context, say not found—do not estimate.”
Post-generation verifier (rules + small model): extract numbers from the draft answer; check each appears in cited passages (allow unit normalization); flag mismatches → block or rewrite.
For high-risk tenants, human approval on answers containing material numbers.

# Simplified retrieval request (conceptual)
POST /retrieve
{
  "query": "adjusted EBITDA Q2 FY2024",
  "filters": { "doc_type": "10-K", "issuer": "ACME" },
  "paths": ["lexical", "vector", "numeric", "tables"],
  "fusion": "rrf",
  "rerank_top_k": 12,
  "pin_numeric_hits": true
}

Step 9 — Napkin math (why hybrid is worth the ops cost)

Illustrative scale for a bank-internal corpus:

50k PDFs, ~80 pages average → 4M pages.
~6 logical blocks per page → ~24M index records (lexical + vector ids point to same blocks).
Peak QPS 20, p95 retrieval budget 800ms including rerank on cached warm indexes.
Structured numeric store: ~40M fact rows after table extraction—fits a column store with partition by doc_id.

Hybrid adds two indexes and a reranker GPU pool—but avoids the class of errors that have unbounded financial downside.

Step 10 — How I would evaluate “did search find the number?”

Generic RAG benchmarks (nDCG on MS MARCO) lie to you here. Track finance-specific sets:

Metric	What it measures
Numeric recall@k	Given (doc, metric, period), is the correct value in top-k chunks?
Table row recall@k	Full row retrieved with headers intact
Citation accuracy	Does cited span actually contain the claimed number?
Answer number F1	Extracted figures in answer vs gold vs context
Abstention rate	Correct “not in documents” when evidence missing

Build 200–500 human-labeled questions from real filings where analysts already know the gold page and cell. Regression-test every ingest parser change on this set.

Step 11 — Failure modes and mitigations

Wrong scale (thousands vs millions) — footnote detector + normalization in structured index.
Right number, wrong year — period metadata required on every chunk; filter by fiscal_period in query router.
Scanned PDF / image tables — OCR quality gate; route low-confidence pages to human review queue.
Retrieval returns narrative, misses table — pin table rows when query has metric tokens; boost table index weight.
LLM swaps two similar figures — verifier compares digit strings to cited spans before publish.

Step 12 — Goals → knobs

Goal	Knob
Never miss the right figure	Hybrid fusion + pin structured hits + high recall@k before rerank
Precision in the context window	Cross-encoder rerank; fewer, richer chunks
Auditable answers	Page-level citations; logged retrieval snapshot per query
Lower latency	Warm lexical cache; smaller rerank pool; async pre-fetch for known dashboards
Lower cost	Cheaper reranker; cache frequent queries; skip vector path on pure lookup queries

Step 13 — Close the loop

On a whiteboard: draw four retrieval paths into fusion → rerank → cite → verify. Say why vector-only loses on “Q3 revenue $4.2B.”

Out loud: walk one query where BM25 finds the line item, vectors find the paraphrase, structured index locks the cell value.

In production: monitor numeric recall@k weekly; treat parser regressions as Sev-1 for finance tenants.

The one line to remember

For financial PDF RAG, the best search is hybrid retrieval plus structured numbers—lexical and tables for precision, vectors for paraphrase, reranking for the final context, and a verification gate so the model never free-types a figure that search did not support.