🚀 Full Stack RAG Application

End-to-End RAG Application: "Intelligent Document Question-Answering System"

Production-Ready Retrieval-Augmented Generation with FastAPI

🚀 Live Application

🌐 The application is deployed and live

Note

The initial load of the web app may take 1-2 minutes. Once loaded, refresh the page to ensure all features work correctly.

Tip

For the best experience, please refer to the Usage Guide section below to learn how to navigate and use the web app effectively.

📌 Overview

This project is a full-stack Retrieval-Augmented Generation (RAG) application that enables users to upload and query documents, retrieve semantically relevant context, and generate grounded answers using Large Language Models. Built with FastAPI, PostgreSQL + pgvector, FAISS, Google Gemini Embeddings, Cohere Reranker, and Groq & Gemini LLMs.

The application combines Semantic Search, Vector Databases, Cross-Encoder Reranking, LLM-Based Answer Generation, and a Modern React Frontend to deliver accurate and explainable answers from custom document collections.

🎯 Project Overview

1. Multi-Format Document Support

Supports ingestion of PDF, Markdown, and Plain Text documents
Each document type has a dedicated cleaning pipeline for high-quality preprocessing

2. Intelligent Document Preprocessing

PDF Cleaning: Removes headers/footers, website artifacts, OCR-style word fragmentation, image placeholders, and normalizes whitespace
Markdown Cleaning: Removes navigation sections and Mermaid diagrams, converts markdown links to clean text, preserves document hierarchy, and repairs broken tables
Text Cleaning: Removes citation markers, repairs paragraph flow, normalizes whitespace, and handles formatting artifacts

3. Metadata-Aware Chunking

Implemented markdown header-aware splitting with recursive character chunking
Configurable chunk size and overlap with section preservation
Deterministic chunk IDs with rich metadata stored per chunk including source file, file type, page number, section name, chunk index, and parent document ID

4. Embedding Generation

Leveraged Google Gemini Embeddings (gemini-embedding-001) for vector embeddings
Supports document and query embeddings with batch processing
Includes free-tier rate-limit protection with automatic throttling

5. Vector Storage with PostgreSQL + pgvector

Persistent knowledge base backed by PostgreSQL with the pgvector extension
Stores chunk text, embeddings, and source metadata
Supports similarity search, metadata filtering, and persistent storage

6. Advanced Retrieval Pipeline

Two-stage retrieval: pgvector similarity search followed by Cohere Cross-Encoder Reranking (rerank-v3.5)
Reranking improves precision, reduces irrelevant chunks, and raises overall answer quality

7. LLM Answer Generation

Integrated Llama 3.3 70B Versatile via Groq for fast, grounded answer generation
The model only answers from retrieved context, cites sources, reduces hallucinations, and returns explainable responses
Separate temperature settings for conversational replies vs. retrieval-grounded answers

8. Intent-Aware Query Routing

Every user query is classified as CONVERSATIONAL or RETRIEVAL before any processing
Conversational queries (greetings, capability questions, small talk) are handled directly by the LLM — no retrieval triggered
Retrieval queries are routed through the full RAG pipeline
Uses Gemini Flash (gemini-2.0-flash-lite) as a lightweight, low-latency classifier — preserving Groq token quota for generation

9. Conversation Memory

Implements sliding window memory storing the last 10 messages (user + assistant) per session
Memory is scoped per session ID — each browser tab maintains isolated conversation history
History is injected into both conversational and retrieval responses, enabling natural follow-up questions
Supports coreference resolution — vague queries like "what does it mean?" or "tell me more" are rewritten into self-contained search queries before retrieval

10. Session-Based Document Chat

Users can upload documents at runtime; these are chunked, embedded, and indexed in FAISS without modifying the global database
Provides temporary workspaces with fast retrieval and session isolation

11. Modern React Frontend

Built with React, TypeScript, Vite, and Tailwind CSS
Features a chat interface, source chunk viewer, session uploads, responsive design, and real-time API integration

🚀 Features

Multi-Format Ingestion: Upload and query PDF, Markdown, and Text documents seamlessly
Intelligent Preprocessing: Dedicated cleaning pipelines per document type for high-quality chunking
Semantic Search: Dense vector retrieval using Google Gemini Embeddings and pgvector
Cross-Encoder Reranking: Uses Cohere rerank-v3.5 to improve retrieval precision
Grounded LLM Answers: Llama 3.3 70B via Groq answers only from retrieved context, reducing hallucinations
Session-Based Chat: Upload documents at runtime, indexed in FAISS without touching the global database
Intent-Aware Routing: Classifies every query as conversational or retrieval — greetings and small talk never trigger unnecessary vector search
Conversation Memory: Sliding window memory per session enables natural multi-turn conversations and follow-up questions
Query Rewriting: Vague coreference queries are automatically rewritten into precise search queries using conversation history
Source Transparency: Every answer includes source chunks so users can verify the retrieved context
Modern Frontend: Responsive chat UI built with React, TypeScript, and Tailwind CSS

🏗️ System Architecture

                    ┌──────────────────┐
                    │ React Frontend   │
                    └────────┬─────────┘
                             │
                             ▼
                    ┌──────────────────┐
                    │ FastAPI Backend  │
                    └────────┬─────────┘
                             │
                    ┌────────▼─────────┐
                    │  Query Router    │  ← Gemini Flash (intent classification)
                    └────────┬─────────┘
                             │
              ┌──────────────┴──────────────┐
              │                             │
              ▼                             ▼
     CONVERSATIONAL                     RETRIEVAL
              │                             │
              │                    ┌────────▼────────┐
              │                    │  Query Rewriter │  ← resolves coreferences
              │                    └────────┬────────┘
              │                             │
              │          ┌──────────────────┼──────────────────┐
              │          │                                     │
              │          ▼                                     ▼
              │  ┌───────────────┐                 ┌─────────────────┐
              │  │  Global RAG   │                 │   Session RAG   │
              │  │   pgvector    │                 │     FAISS       │
              │  │     Neon      │                 │ In-Memory Index │
              │  └───────┬───────┘                 └────────┬────────┘
              │          │                                   │
              │          ▼                                   ▼
              │    Similarity Search               Similarity Search
              │          │                                   │
              │          ▼                                   │
              │   Cohere Reranker                            │
              │          │                                   │
              └──────────┴───────────────────────────────────┘
                         │
                         ▼
              ┌─────────────────────┐
              │  Groq LLM Generator │  ← history + context injected
              │  (+ Memory/History) │
              └─────────┬───────────┘
                        │
                        ▼
                   Final Answer

🏗️ Tech Stack

Python
PyMuPDF4LLM + LangChain Text Splitters (Document processing)
Google Gemini Embeddings (gemini-embedding-001)
PostgreSQL + pgvector (Neon) (Persistent vector database)
FAISS (In-memory vector index for session-based retrieval)
Cohere Rerank (rerank-v3.5 for cross-encoder reranking)
Gemini Flash (gemini-2.0-flash-lite) (Intent classification / query routing)
Sliding Window Memory (Per-session conversation history via in-memory registry)
Groq API (Accessing Llama 3.3 70B Versatile)
FastAPI (Backend API framework with Pydantic & Psycopg)
React + TypeScript + Vite (Modern frontend)
Tailwind CSS (Frontend styling)

📂 Project Structure

fullstack-rag-application
│
├── documents/                        # Source documents used for ingestion
│   ├── markdown/                     # Markdown files (NemoClaw documentation)
│   ├── pdfs/                         # PDF files (Apple product tech specs)
│   └── text/                         # Plain text files (Space exploration articles)
│
├── frontend/                         # React + TypeScript frontend application
│   └── src/
│       ├── components/               # Reusable UI components (chat, input, source chunks)
│       ├── types/                    # TypeScript type definitions
│       ├── api.ts                    # API calls to the FastAPI backend
│       ├── App.tsx                   # Root application component
│       └── main.tsx                  # Application entry point
│
├── notebooks/                        # Jupyter notebooks for experiments and pipeline testing
│
├── src/                              # Core backend source code
│   ├── core/                         # Main RAG pipeline modules
│   │   ├── chunker.py                # Document chunking logic
│   │   ├── embedding.py              # Gemini embedding generation
│   │   ├── reranker.py               # Cohere cross-encoder reranking
│   │   ├── retriever.py              # Vector similarity retrieval
│   │   └── vector_store.py           # pgvector and FAISS store management
│   │   ├── query_router.py           # Intent classifier (CONVERSATIONAL vs RETRIEVAL)
│   │   ├── memory.py                 # Sliding window conversation memory + session registry
│   │   ├── llama_generator.py        # Groq LLM answer generation
│   ├── loaders/                      # Document loaders for each file type (pdf, md, txt)
│   ├── preprocess/                   # Document cleaners for each file type (pdf, md, txt)
│   ├── models/                       # Data models schema
│   └── utils/                        # Config and path utility helpers
│
├── app.py                            # FastAPI application and route definitions
├── ingest.py                         # Document ingestion pipeline (load → clean → chunk → embed → store)
├── session_pipeline.py               # Session-based FAISS pipeline for runtime document uploads
├── golden_QA.md                      # Golden Q&A dataset for evaluation and benchmarking
├── requirements.txt                  # Python dependencies
└── pyproject.toml                    # Project metadata and build configuration

🚀 Installation & Setup

1️⃣ Clone the Repository

git clone https://github.com/yourusername/fullstack-rag-application.git
cd fullstack-rag-application

2️⃣ Create a Virtual Environment

conda create -p env python=3.11 -y
conda activate env

or

python -m venv env

3️⃣ Install Dependencies

pip install -r requirements.txt

4️⃣ Set Up Environment Variables

Create a .env file in the root directory and add:

GEMINI_API_KEY=xxxxxxxxxxxx
GROQ_API_KEY=xxxxxxxxxxxx
COHERE_API_KEY=xxxxxxxxxxxx
DATABASE_URL=postgresql://user:password@localhost:5432/rag_db

5️⃣ Ingest Documents

python ingest.py

6️⃣ Run the Backend

uvicorn app:app --reload

Backend available at http://localhost:8000, Swagger docs at http://localhost:8000/docs

7️⃣ Run the Frontend

cd frontend
npm install
npm run dev

Frontend available at http://localhost:5173

🌐 Usage Guide

👉 Access the web app

Global Knowledge Base Chat: Ask questions about any pre-ingested documents
- "What is NemoClaw?"
- "Summarize the key points from the research paper."
Session Document Upload: Upload your own PDF, Markdown, or Text file and chat with it
- "Summarize the uploaded document."
- "What are the main conclusions?"
Source Verification: Every answer displays the retrieved source chunks so you can verify context
API Access:
- Global chat: POST /api/chat/global with x-session-id header
- Session chat: POST /api/chat/session with x-session-id header
- Upload: POST /api/upload with x-session-id header

🧪 Evaluation Framework

📂 Full Q&A pairs with expected answers are available here → golden_QA.md

A Golden Q&A dataset is included to benchmark and validate the RAG pipeline's retrieval and answer quality across all three supported document types — Markdown, Text, and PDF.

The dataset covers 30 primary evaluation pairs and 5 documented failed cases, making it suitable for both pass/fail testing and analysis.

📊 Pipelines

Ingestion Pipeline

Documents (PDF, Markdown, TXT)
    │
    ▼
Document Loader
    │
    ▼
Preprocessing & Cleaning
    │
    ▼
Metadata-Aware Chunking
    │
    ▼
Gemini Embeddings (gemini-embedding-001)
    │
    ▼
pgvector Store (Neon)

Query Pipeline

User Query
    │
    ▼
Intent Classifier (Gemini Flash)
    │
    ├──── CONVERSATIONAL ────────────────────────────┐
    │                                                │
    └──── RETRIEVAL                                  │
              │                                      │
              ▼                                      │
    Coreference Check                                │
    (needs rewrite?)                                 │
       │          │                                  │
      YES         NO                                 │
       │          │                                  │
       ▼          │                                  │
    Query         │                                  │
    Rewriter      │                                  │
       │          │                                  │
       └────┬─────┘                                  │
            │                                        │
            ▼                                        │
    Similarity Search                                │
    (pgvector / FAISS)                               │
            │                                        │
            ▼                                        │
    Cohere Reranking                                 │
            │                                        │
            └──────────────┬─────────────────────────┘
                           │
                           ▼
              Groq LLM — Llama 3.3 70B
              (context + memory injected)
                           │
                           ▼
                   Grounded Response

📡 API Endpoints

Global Knowledge Base Chat

POST /api/chat/global

Headers:

x-session-id: session_123

Request:

{
  "question": "What is NemoClaw?"
}

Upload Documents

POST /api/upload

Headers:

x-session-id: session_123

Session Chat

POST /api/chat/session

Headers:

x-session-id: session_123

Request:

{
  "question": "Summarize the uploaded document"
}

📸 Screenshots

Screenshot of the web application:

Screenshot of the chat interface:

🎯 Future Improvements

Hybrid search (BM25 + Dense Retrieval)
HNSW indexing
Persistent conversation memory across sessions (database-backed)
Multi-modal retrieval
Citation highlighting in the UI
Docker and Kubernetes deployment
LangGraph agent workflows
Evaluation framework integration (RAGAS)

🤝 Contributing

💡 Have an idea? Feel free to contribute or open an issue and pull requests!

📄 License

This project is licensed under the MIT License – LICENSE

Name		Name	Last commit message	Last commit date
Latest commit History 133 Commits
documents		documents
frontend		frontend
notebooks		notebooks
readme_images		readme_images
src		src
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
app.py		app.py
golden_QA.md		golden_QA.md
ingest.py		ingest.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
session_pipeline.py		session_pipeline.py

Folders and files

Latest commit

History

Repository files navigation

🚀 Full Stack RAG Application

End-to-End RAG Application: "Intelligent Document Question-Answering System"

Production-Ready Retrieval-Augmented Generation with FastAPI

🚀 Live Application

📌 Overview

🎯 Project Overview

1. Multi-Format Document Support

2. Intelligent Document Preprocessing

3. Metadata-Aware Chunking

4. Embedding Generation

5. Vector Storage with PostgreSQL + pgvector

6. Advanced Retrieval Pipeline

7. LLM Answer Generation

8. Intent-Aware Query Routing

9. Conversation Memory

10. Session-Based Document Chat

11. Modern React Frontend

🚀 Features

🏗️ System Architecture

🏗️ Tech Stack

📂 Project Structure

🚀 Installation & Setup

1️⃣ Clone the Repository

2️⃣ Create a Virtual Environment

3️⃣ Install Dependencies

4️⃣ Set Up Environment Variables

5️⃣ Ingest Documents

6️⃣ Run the Backend

7️⃣ Run the Frontend

🌐 Usage Guide

🧪 Evaluation Framework

📊 Pipelines

Ingestion Pipeline

Query Pipeline

📡 API Endpoints

Global Knowledge Base Chat

Upload Documents

Session Chat

📸 Screenshots

Screenshot of the web application:

Screenshot of the chat interface:

🎯 Future Improvements

🤝 Contributing

📄 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages