Hybrid BM25 + Semantic retrieval pipeline over financial filings, with citation-aware generation and production observability.
| Resource | URL |
|---|---|
| Streamlit Ui | https://huggingface.co/spaces/Spectraa28/financial-rag-api |
| Metric | Value |
|---|---|
| RAGAs Faithfulness | 0.82 |
| Context Recall | 0.0 → 1.0 (post-Docling fix) |
| Cold Start Latency | 46s → 3s (persistent ChromaDB) |
| Total Chunks | 453+ |
| Retrieval Alpha (BM25:Dense) | 0.3 : 0.7 |
Evaluated on 30 manually curated financial QA pairs across Apple and Microsoft 10-K filings. Cold start fixed by pre-embedding documents and persisting ChromaDB — ingestion pipeline skips on subsequent startups.
User Query
│
▼
Query Expansion (vocabulary bridging for financial terms)
│
▼
Hybrid Retrieval
├── BM25 (sparse, keyword precision) weight: 0.3
└── MiniLM Dense Search (semantic) weight: 0.7
│
▼
Score Fusion → Top-3 Chunks with Citations
│
▼
Citation-Aware Prompt Engineering
│
▼
Gemini 2.5 Flash Generation (temperature=0.0)
│
▼
Answer + Source Citations + Latency Telemetry
Financial PDFs are one of the hardest RAG targets — tables with misaligned headers, XBRL-encoded values, multi-year comparative data, and dense numerical prose. This project was built specifically to handle these challenges:
- IBM Docling for layout-aware parsing — preserves table structure and section hierarchy
- Query expansion to bridge vocabulary gap between natural language and financial terminology
- BM25 weighted at 0.3 to preserve exact keyword matching for ticker symbols, line items, and financial metrics
- Citation path tracking through document heading hierarchy for auditability
| Component | Technology |
|---|---|
| Document Parsing | IBM Docling + HybridChunker |
| Embeddings | all-MiniLM-L6-v2 (sentence-transformers) |
| Vector Store | ChromaDB (persistent, cosine similarity) |
| Sparse Retrieval | BM25Okapi (rank-bm25) |
| LLM | Google Gemini 2.5 Flash |
| API Framework | FastAPI + Uvicorn |
| UI | Gradio |
| Observability | MLflow + Prometheus |
| Deployment | HuggingFace Spaces |
financial-rag-api/
├── app.py → Gradio UI entry point
├── ingestion.py → Docling parsing, chunking, ChromaDB storage
├── retrieval.py → Hybrid search, query expansion, citation formatting
├── pipeline.py → Generation, MLflow tracking, Prometheus metrics
├── main.py → FastAPI app with /query, /health, /metrics endpoints
├── chroma_db/ → Pre-embedded persistent vector store
└── requirements.txt
# Clone the repo
git clone https://github.com/Spectraa28/financial-rag-api
cd financial-rag-api
# Install dependencies
pip install -r requirements.txt
# Set environment variables
cp .env.example .env
# Add your GEMINI_API_KEY to .env
# Run Gradio UI
python app.py
# Or run FastAPI server
uvicorn main:app --host 0.0.0.0 --port 8000App runs on http://localhost:7860 (Gradio) or http://localhost:8000 (FastAPI)
Simple retrieval:
What was Apple's total revenue in FY2023?
What is Apple's cash position as of FY2023?
What are Apple's primary business segments?
Complex analysis:
Compare Apple and Microsoft's operating margins in FY2023
What risks did Apple identify in their 2023 annual report?
How did Apple's R&D expenses trend in FY2023?
Cross-company (Phase 1 complete):
Which company had higher net income in FY2023 — Apple or Microsoft?
Compare the debt structures of Apple and Microsoft in FY2023
The biggest bottleneck wasn't retrieval quality — it was ingestion quality. My first implementation used manual BeautifulSoup table parsing. Tables had column misalignment, garbage values from XBRL spacer cells, and broken row boundaries. Context Recall was 0.0 because the embedded text was structurally corrupted before it ever reached the vector store.
Switching to IBM Docling fixed this entirely — it understands document layout, preserves table structure, and tracks heading hierarchy for citation paths. Context Recall jumped from 0.0 to 1.0.
The hybrid search alpha of 0.7 dense / 0.3 sparse came from iterative testing. Pure semantic search missed exact financial terms like specific line items and ticker symbols. Pure BM25 missed semantically equivalent phrasings. 0.7/0.3 gave the best balance across both simple lookup and complex analytical queries.
Evaluated using RAGAs on 30 manually curated QA pairs:
| Metric | Score |
|---|---|
| Faithfulness | 0.82 |
| Context Recall | 1.0 |
| Context Precision | 0.27 |
| Answer Relevancy | in progress |
Context Precision of 0.27 reflects a known limitation — fixed top-k retrieval pulls in loosely related chunks alongside the relevant ones. The natural fix is a cross-encoder reranker, which is planned for Phase 2.
- IBM Docling ingestion with layout-aware chunking
- Hybrid BM25 + dense retrieval (alpha=0.7)
- Query expansion for financial vocabulary
- Citation-aware prompt engineering
- MLflow + Prometheus observability
- Apple and Microsoft 10-K FY2023 coverage
- Gradio UI + FastAPI deployment
- PDF upload endpoint — ingest any 10-K dynamically
- Query decomposition agent — LLM breaks complex cross-company questions into per-company sub-queries
- Cross-encoder reranker — second pass on top-k results to fix context precision
- Cross-company synthesizer — unified answer with citations across multiple filings
- Company registry — stateful tracking of loaded documents
- Async ingestion with status polling
Built by Sonu Verma
Part of a 126-day self-directed ML Engineering program. Building in public.
---