🌐 The application is deployed and live
Note
The initial load of the web app may take 1-2 minutes. Once loaded, refresh the page to ensure all features work correctly.
Tip
For the best experience, please refer to the Usage Guide section below to learn how to navigate and use the web app effectively.
This project is a full-stack Retrieval-Augmented Generation (RAG) application that enables users to upload and query documents, retrieve semantically relevant context, and generate grounded answers using Large Language Models. Built with FastAPI, PostgreSQL + pgvector, FAISS, Google Gemini Embeddings, Cohere Reranker, and Groq & Gemini LLMs.
The application combines Semantic Search, Vector Databases, Cross-Encoder Reranking, LLM-Based Answer Generation, and a Modern React Frontend to deliver accurate and explainable answers from custom document collections.
- Supports ingestion of PDF, Markdown, and Plain Text documents
- Each document type has a dedicated cleaning pipeline for high-quality preprocessing
- PDF Cleaning: Removes headers/footers, website artifacts, OCR-style word fragmentation, image placeholders, and normalizes whitespace
- Markdown Cleaning: Removes navigation sections and Mermaid diagrams, converts markdown links to clean text, preserves document hierarchy, and repairs broken tables
- Text Cleaning: Removes citation markers, repairs paragraph flow, normalizes whitespace, and handles formatting artifacts
- Implemented markdown header-aware splitting with recursive character chunking
- Configurable chunk size and overlap with section preservation
- Deterministic chunk IDs with rich metadata stored per chunk including source file, file type, page number, section name, chunk index, and parent document ID
- Leveraged Google Gemini Embeddings (
gemini-embedding-001) for vector embeddings - Supports document and query embeddings with batch processing
- Includes free-tier rate-limit protection with automatic throttling
- Persistent knowledge base backed by PostgreSQL with the pgvector extension
- Stores chunk text, embeddings, and source metadata
- Supports similarity search, metadata filtering, and persistent storage
- Two-stage retrieval: pgvector similarity search followed by Cohere Cross-Encoder Reranking (
rerank-v3.5) - Reranking improves precision, reduces irrelevant chunks, and raises overall answer quality
- Integrated Llama 3.3 70B Versatile via Groq for fast, grounded answer generation
- The model only answers from retrieved context, cites sources, reduces hallucinations, and returns explainable responses
- Separate temperature settings for conversational replies vs. retrieval-grounded answers
- Every user query is classified as CONVERSATIONAL or RETRIEVAL before any processing
- Conversational queries (greetings, capability questions, small talk) are handled directly by the LLM — no retrieval triggered
- Retrieval queries are routed through the full RAG pipeline
- Uses Gemini Flash (
gemini-2.0-flash-lite) as a lightweight, low-latency classifier — preserving Groq token quota for generation
- Implements sliding window memory storing the last 10 messages (user + assistant) per session
- Memory is scoped per session ID — each browser tab maintains isolated conversation history
- History is injected into both conversational and retrieval responses, enabling natural follow-up questions
- Supports coreference resolution — vague queries like "what does it mean?" or "tell me more" are rewritten into self-contained search queries before retrieval
- Users can upload documents at runtime; these are chunked, embedded, and indexed in FAISS without modifying the global database
- Provides temporary workspaces with fast retrieval and session isolation
- Built with React, TypeScript, Vite, and Tailwind CSS
- Features a chat interface, source chunk viewer, session uploads, responsive design, and real-time API integration
- Multi-Format Ingestion: Upload and query PDF, Markdown, and Text documents seamlessly
- Intelligent Preprocessing: Dedicated cleaning pipelines per document type for high-quality chunking
- Semantic Search: Dense vector retrieval using Google Gemini Embeddings and pgvector
- Cross-Encoder Reranking: Uses Cohere rerank-v3.5 to improve retrieval precision
- Grounded LLM Answers: Llama 3.3 70B via Groq answers only from retrieved context, reducing hallucinations
- Session-Based Chat: Upload documents at runtime, indexed in FAISS without touching the global database
- Intent-Aware Routing: Classifies every query as conversational or retrieval — greetings and small talk never trigger unnecessary vector search
- Conversation Memory: Sliding window memory per session enables natural multi-turn conversations and follow-up questions
- Query Rewriting: Vague coreference queries are automatically rewritten into precise search queries using conversation history
- Source Transparency: Every answer includes source chunks so users can verify the retrieved context
- Modern Frontend: Responsive chat UI built with React, TypeScript, and Tailwind CSS
┌──────────────────┐
│ React Frontend │
└────────┬─────────┘
│
▼
┌──────────────────┐
│ FastAPI Backend │
└────────┬─────────┘
│
┌────────▼─────────┐
│ Query Router │ ← Gemini Flash (intent classification)
└────────┬─────────┘
│
┌──────────────┴──────────────┐
│ │
▼ ▼
CONVERSATIONAL RETRIEVAL
│ │
│ ┌────────▼────────┐
│ │ Query Rewriter │ ← resolves coreferences
│ └────────┬────────┘
│ │
│ ┌──────────────────┼──────────────────┐
│ │ │
│ ▼ ▼
│ ┌───────────────┐ ┌─────────────────┐
│ │ Global RAG │ │ Session RAG │
│ │ pgvector │ │ FAISS │
│ │ Neon │ │ In-Memory Index │
│ └───────┬───────┘ └────────┬────────┘
│ │ │
│ ▼ ▼
│ Similarity Search Similarity Search
│ │ │
│ ▼ │
│ Cohere Reranker │
│ │ │
└──────────┴───────────────────────────────────┘
│
▼
┌─────────────────────┐
│ Groq LLM Generator │ ← history + context injected
│ (+ Memory/History) │
└─────────┬───────────┘
│
▼
Final Answer
- Python
- PyMuPDF4LLM + LangChain Text Splitters (Document processing)
- Google Gemini Embeddings (
gemini-embedding-001) - PostgreSQL + pgvector (
Neon) (Persistent vector database) - FAISS (In-memory vector index for session-based retrieval)
- Cohere Rerank (
rerank-v3.5for cross-encoder reranking) - Gemini Flash (
gemini-2.0-flash-lite) (Intent classification / query routing) - Sliding Window Memory (Per-session conversation history via in-memory registry)
- Groq API (Accessing Llama 3.3 70B Versatile)
- FastAPI (Backend API framework with Pydantic & Psycopg)
- React + TypeScript + Vite (Modern frontend)
- Tailwind CSS (Frontend styling)
fullstack-rag-application
│
├── documents/ # Source documents used for ingestion
│ ├── markdown/ # Markdown files (NemoClaw documentation)
│ ├── pdfs/ # PDF files (Apple product tech specs)
│ └── text/ # Plain text files (Space exploration articles)
│
├── frontend/ # React + TypeScript frontend application
│ └── src/
│ ├── components/ # Reusable UI components (chat, input, source chunks)
│ ├── types/ # TypeScript type definitions
│ ├── api.ts # API calls to the FastAPI backend
│ ├── App.tsx # Root application component
│ └── main.tsx # Application entry point
│
├── notebooks/ # Jupyter notebooks for experiments and pipeline testing
│
├── src/ # Core backend source code
│ ├── core/ # Main RAG pipeline modules
│ │ ├── chunker.py # Document chunking logic
│ │ ├── embedding.py # Gemini embedding generation
│ │ ├── reranker.py # Cohere cross-encoder reranking
│ │ ├── retriever.py # Vector similarity retrieval
│ │ └── vector_store.py # pgvector and FAISS store management
│ │ ├── query_router.py # Intent classifier (CONVERSATIONAL vs RETRIEVAL)
│ │ ├── memory.py # Sliding window conversation memory + session registry
│ │ ├── llama_generator.py # Groq LLM answer generation
│ ├── loaders/ # Document loaders for each file type (pdf, md, txt)
│ ├── preprocess/ # Document cleaners for each file type (pdf, md, txt)
│ ├── models/ # Data models schema
│ └── utils/ # Config and path utility helpers
│
├── app.py # FastAPI application and route definitions
├── ingest.py # Document ingestion pipeline (load → clean → chunk → embed → store)
├── session_pipeline.py # Session-based FAISS pipeline for runtime document uploads
├── golden_QA.md # Golden Q&A dataset for evaluation and benchmarking
├── requirements.txt # Python dependencies
└── pyproject.toml # Project metadata and build configuration
git clone https://github.com/yourusername/fullstack-rag-application.git
cd fullstack-rag-applicationconda create -p env python=3.11 -y
conda activate envor
python -m venv envpip install -r requirements.txtCreate a .env file in the root directory and add:
GEMINI_API_KEY=xxxxxxxxxxxx
GROQ_API_KEY=xxxxxxxxxxxx
COHERE_API_KEY=xxxxxxxxxxxx
DATABASE_URL=postgresql://user:password@localhost:5432/rag_dbpython ingest.pyuvicorn app:app --reloadBackend available at http://localhost:8000, Swagger docs at http://localhost:8000/docs
cd frontend
npm install
npm run devFrontend available at http://localhost:5173
- Global Knowledge Base Chat: Ask questions about any pre-ingested documents
- "What is NemoClaw?"
- "Summarize the key points from the research paper."
- Session Document Upload: Upload your own PDF, Markdown, or Text file and chat with it
- "Summarize the uploaded document."
- "What are the main conclusions?"
- Source Verification: Every answer displays the retrieved source chunks so you can verify context
- API Access:
- Global chat:
POST /api/chat/globalwithx-session-idheader - Session chat:
POST /api/chat/sessionwithx-session-idheader - Upload:
POST /api/uploadwithx-session-idheader
- Global chat:
📂 Full Q&A pairs with expected answers are available here →
golden_QA.md
A Golden Q&A dataset is included to benchmark and validate the RAG pipeline's retrieval and answer quality across all three supported document types — Markdown, Text, and PDF.
The dataset covers 30 primary evaluation pairs and 5 documented failed cases, making it suitable for both pass/fail testing and analysis.
Documents (PDF, Markdown, TXT)
│
▼
Document Loader
│
▼
Preprocessing & Cleaning
│
▼
Metadata-Aware Chunking
│
▼
Gemini Embeddings (gemini-embedding-001)
│
▼
pgvector Store (Neon)
User Query
│
▼
Intent Classifier (Gemini Flash)
│
├──── CONVERSATIONAL ────────────────────────────┐
│ │
└──── RETRIEVAL │
│ │
▼ │
Coreference Check │
(needs rewrite?) │
│ │ │
YES NO │
│ │ │
▼ │ │
Query │ │
Rewriter │ │
│ │ │
└────┬─────┘ │
│ │
▼ │
Similarity Search │
(pgvector / FAISS) │
│ │
▼ │
Cohere Reranking │
│ │
└──────────────┬─────────────────────────┘
│
▼
Groq LLM — Llama 3.3 70B
(context + memory injected)
│
▼
Grounded Response
POST /api/chat/globalHeaders:
x-session-id: session_123
Request:
{
"question": "What is NemoClaw?"
}POST /api/uploadHeaders:
x-session-id: session_123
POST /api/chat/sessionHeaders:
x-session-id: session_123
Request:
{
"question": "Summarize the uploaded document"
}- Hybrid search (BM25 + Dense Retrieval)
- HNSW indexing
- Persistent conversation memory across sessions (database-backed)
- Multi-modal retrieval
- Citation highlighting in the UI
- Docker and Kubernetes deployment
- LangGraph agent workflows
- Evaluation framework integration (RAGAS)
💡 Have an idea? Feel free to contribute or open an issue and pull requests!
This project is licensed under the MIT License – LICENSE

