sharpbyte.dev

RAG for financial PDFs — which search method would you pick?

You are asked to design retrieval for financial PDFs where a single missed number in search can cost millions. This guide is written from an architect’s chair: what I would ship, why plain vector search is not enough, and how each layer works in plain language.

Design prompt

You’re building RAG for financial PDFs. Missing a single number from search can cost millions.

What is the best RAG search method you can use here?

What you should be able to do after reading:

Step 0 — How to work the problem

  1. Restate the failure mode. Wrong prose is bad; wrong revenue, EPS, basis points, or effective date is catastrophic.
  2. Separate “search” from “answer.” Search must maximize recall of the right evidence; the LLM only summarizes what retrieval returns.
  3. Assume PDFs are hostile input. Multi-column layout, tables split across pages, headers repeated, numbers in images.
  4. Pick retrieval for recall, then spend latency on reranking and verification—not on hoping one embedding model “understands finance.”

Step 1 — The short answer

Best practical choice: a hybrid retriever (lexical BM25 + dense vectors) fused with learned reranking, sitting on top of table-aware PDF parsing and a structured numeric/fact index (figures, units, periods, page anchors). Add a mandatory citation + number-consistency check before the user sees the answer.

If you only remember one sentence for a whiteboard: “Vectors find the neighborhood; lexical and structured lookup find the exact million dollars.”

Step 2 — Why this problem is harder than “normal” RAG

Financial PDFs punish generic RAG pipelines in predictable ways:

Step 3 — What I would not rely on alone

ApproachWhy it fails finance PDFs
Vector-only (single embedding index)Weak exact match on digits, tickers, dates; table cells lose structure
Fixed-size text chunks (e.g. 500 tokens)Splits tables and sentences; drops footnotes attached to the wrong chunk
HyDE / query expansion onlyCan hallucinate extra numbers into the query; hurts precision
One reranker, no lexical pathStill misses rare strings (“Series B warrant”, ISIN, GAAP line items)
LLM answers without citationsNo audit trail; compliance and debugging impossible

Step 4 — Target architecture (four retrieval paths, one fuse)

Think of retrieval as four specialists that vote, then a judge (reranker) picks the final evidence pack:

flowchart TB
  Q[User query] --> ROUTE[Query router]
  ROUTE --> LEX[BM25 / lexical index]
  ROUTE --> VEC[Dense vector index]
  ROUTE --> NUM[Structured numeric index]
  ROUTE --> TBL[Table row index]
  PDF[PDF ingest] --> PARSE[Layout + table parser]
  PARSE --> LEX
  PARSE --> VEC
  PARSE --> NUM
  PARSE --> TBL
  LEX --> FUSE[RRF or weighted fusion]
  VEC --> FUSE
  NUM --> FUSE
  TBL --> FUSE
  FUSE --> RERANK[Cross-encoder rerank]
  RERANK --> PACK[Context pack + citations]
  PACK --> LLM[LLM generate]
  PACK --> VERIFY[Number verify gate]
  LLM --> VERIFY
  VERIFY --> OUT[Answer to user]
    

Path A — Lexical search (BM25 / OpenSearch / Elasticsearch)

What it does: Matches exact tokens—EBITDA, Q2 2024, 14.3%, company names, accounting phrases. Why you need it: When the user quotes a label or period, lexical search is still the highest-recall way to land on the right paragraph or table row.

How I implement it: Index each logical block (paragraph, table row, footnote) with rich metadata: doc_id, page, section_title, fiscal_period, statement_type (income, balance, cash flow).

Path B — Dense vectors

What it does: Finds passages that are semantically close when wording varies—“cost of revenue” vs “COGS”, “net loss” vs “negative earnings.” Why you still need it: Users do not always use the same words as the PDF. Vectors recover paraphrases lexical search misses.

How I implement it: Embed at the same logical block granularity as BM25 (not random token windows). Store the same metadata on vector ids so both indexes align. Use a finance-tuned or general-purpose embedding model; the bigger win is chunking, not chasing the latest embedding name.

Path C — Structured numeric / fact index

What it does: Treats numbers as first-class records: value, unit, currency, scale (thousands/millions), metric name, period, document, page, surrounding label text. Why it matters: Query “revenue Q3 2024” becomes a lookup + filter problem, not similarity search in 1,536 dimensions.

How I implement it:

Path D — Table row index

Tables deserve their own treatment. Serialize each row as: header row + row label + cell values + page + table caption. Index that string in both BM25 and vectors, but also keep the raw row JSON for the LLM so columns stay aligned.

Step 5 — Ingestion: PDF parsing is part of search quality

Search cannot fix garbage ingest. Minimum bar I would enforce:

  1. Layout model (not plain text dump): detect reading order, columns, tables, headers.
  2. Table extraction to structured rows + CSV-like backup per table.
  3. Footnote linking: attach scale notes (“except per share”) to every numeric block on that page.
  4. Provenance always on: every chunk stores page, bbox, doc_version, hash for citations.
  5. Human QA sample: 1–2% of pages manually checked; track numeric extraction error rate.

Operational reality: OCR + layout mistakes are your top source of “missing millions.” Budget engineering time here before buying a fancier vector database.

Step 6 — Fusion and reranking (where precision is won)

Fusion

Run BM25 and vector search in parallel (k≈50 each). Merge with reciprocal rank fusion (RRF) or weighted scores. Always inject structured numeric hits into the merged list even if they rank low semantically—recall first for figures.

Reranking

Send the top ~30 fused candidates through a cross-encoder reranker (query + passage pairs). Keep top 8–12 for the LLM context. Reranking is relatively expensive; that is fine—wrong numbers are more expensive.

Optional upgrade: late-interaction models (ColBERT-style) when you need token-level overlap on long table rows. I treat that as v2 after hybrid + cross-encoder is stable.

Step 7 — Query routing (simple rules beat one-size-fits-all)

Query signalRetrieval bias
Contains digits, %, $, Q1–Q4, FYStructured numeric index + table rows first
Exact GAAP line item (“deferred revenue”)High BM25 weight + table index
Vague strategy questionHigher vector weight on narrative sections
Compare two periodsRetrieve both periods explicitly; two-pass retrieval if needed
Multi-doc portfolioPer-doc caps so one 10-K does not crowd out others

Step 8 — Generation: citations and a number gate

The LLM is not trusted to invent figures. Contract I enforce:

# Simplified retrieval request (conceptual)
POST /retrieve
{
  "query": "adjusted EBITDA Q2 FY2024",
  "filters": { "doc_type": "10-K", "issuer": "ACME" },
  "paths": ["lexical", "vector", "numeric", "tables"],
  "fusion": "rrf",
  "rerank_top_k": 12,
  "pin_numeric_hits": true
}

Step 9 — Napkin math (why hybrid is worth the ops cost)

Illustrative scale for a bank-internal corpus:

Hybrid adds two indexes and a reranker GPU pool—but avoids the class of errors that have unbounded financial downside.

Step 10 — How I would evaluate “did search find the number?”

Generic RAG benchmarks (nDCG on MS MARCO) lie to you here. Track finance-specific sets:

MetricWhat it measures
Numeric recall@kGiven (doc, metric, period), is the correct value in top-k chunks?
Table row recall@kFull row retrieved with headers intact
Citation accuracyDoes cited span actually contain the claimed number?
Answer number F1Extracted figures in answer vs gold vs context
Abstention rateCorrect “not in documents” when evidence missing

Build 200–500 human-labeled questions from real filings where analysts already know the gold page and cell. Regression-test every ingest parser change on this set.

Step 11 — Failure modes and mitigations

Step 12 — Goals → knobs

GoalKnob
Never miss the right figureHybrid fusion + pin structured hits + high recall@k before rerank
Precision in the context windowCross-encoder rerank; fewer, richer chunks
Auditable answersPage-level citations; logged retrieval snapshot per query
Lower latencyWarm lexical cache; smaller rerank pool; async pre-fetch for known dashboards
Lower costCheaper reranker; cache frequent queries; skip vector path on pure lookup queries

Step 13 — Close the loop

On a whiteboard: draw four retrieval paths into fusion → rerank → cite → verify. Say why vector-only loses on “Q3 revenue $4.2B.”

Out loud: walk one query where BM25 finds the line item, vectors find the paraphrase, structured index locks the cell value.

In production: monitor numeric recall@k weekly; treat parser regressions as Sev-1 for finance tenants.

The one line to remember

For financial PDF RAG, the best search is hybrid retrieval plus structured numbers—lexical and tables for precision, vectors for paraphrase, reranking for the final context, and a verification gate so the model never free-types a figure that search did not support.