An end-to-end repository question answering system that indexes a public GitHub codebase, retrieves grounded code evidence, and generates cited answers through a retrieval-augmented generation pipeline.
This project includes:
- a React frontend for repository submission and conversational querying
- a FastAPI backend for indexing and answering questions
- a hybrid retrieval pipeline with semantic search, BM25, and reranking
- an evaluation harness for measuring retrieval quality and answer grounding
This project is written to show the parts recruiters and engineering reviewers usually look for in a personal AI project:
- clear system design
- practical frontend, backend, and deployment integration
- retrieval and ranking logic beyond a single LLM prompt
- measurable evaluation instead of anecdotal demos
- thoughtful tradeoffs around cost, latency, and persistence
Code Compass brings those elements together in one end-to-end application with visible architecture, source citations, and an evaluation harness ready for benchmark results.
- A user pastes a GitHub repository URL into the UI.
- The backend clones the repository into a temporary local directory.
- Source files are filtered and chunked using tree-sitter and fallback text chunking.
- The system generates embeddings for chunks and stores them in a Chroma-backed vector layer.
- At query time, the system retrieves evidence with:
- semantic vector search
- lexical BM25 search
- reciprocal rank fusion
- cross-encoder reranking
- The top grounded chunks are passed to the LLM to generate a concise answer.
- The UI displays the answer with file-level citations and GitHub source links.
┌──────────────────────┐
│ React UI │
│ repo submit + chat │
│ citations + status │
└──────────┬───────────┘
│ HTTP / JSON
▼
┌──────────────────────┐
│ FastAPI Server │
│ routes + session │
│ validation │
└──────────┬───────────┘
│
▼
┌──────────────────────────────────────────────┐
│ CodebaseRAGSystem │
│ indexing orchestration + query orchestration │
└───────┬───────────────┬───────────────┬──────┘
│ │ │
│ │ │
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ RepoFetcher │ │ CodeParser │ │ Embeddings │
│ clone/filter │ │ tree-sitter │ │ Bedrock/local│
└──────┬───────┘ │ fallback │ └──────┬───────┘
│ └──────┬───────┘ │
│ │ │
└────────────┬────┴────────────┬────┘
▼ ▼
┌──────────────┐ ┌──────────────┐
│ In-memory │ │ Chroma │
│ repo/session │ │ vector store │
│ state │ └──────┬───────┘
└──────────────┘ │
▼
┌──────────────────┐
│ Hybrid Retrieval │
│ semantic + BM25 │
│ + reranking │
└────────┬─────────┘
▼
┌──────────────────┐
│ LLM Answerer │
│ grounded answer │
│ + citations │
└──────────────────┘
- React 19
- Tailwind CSS
- Axios for API communication
Responsibilities:
- collect the GitHub repository URL
- poll indexing state
- send chat questions and prior conversation turns
- render markdown-like answers
- display cited files, symbols, and line ranges
Main entry points:
- FastAPI
- Pydantic
- in-memory session and repository state
Responsibilities:
- validate requests
- manage session-scoped repository state
- run indexing in the background
- execute retrieval and answer generation
- return grounded answers and source metadata
Main entry points:
- tree-sitter for code-aware chunking
- Amazon Bedrock or local embeddings for semantic retrieval depending on environment
- BM25 for lexical retrieval
- reciprocal rank fusion to combine retrieval channels
- a cross-encoder reranker for final source ordering
- Groq or Amazon Bedrock generation depending on environment configuration
Core modules:
server/src/code_parser.pyserver/src/embeddings.pyserver/src/hybrid_search.pyserver/src/vector_store.pyserver/src/repo_fetcher.py
POST /api/repos/index- Backend registers the repo against a session
- Background task clones the repo
- Files are filtered by extension, directory, and size
- Files are chunked into code-aware segments
- Embeddings are generated for each chunk
- Chunks are stored in the vector layer and in in-memory retrieval state
- Metadata and progress are exposed back to the UI
POST /api/query- The backend validates the session and repository status
- The question is expanded using lightweight intent heuristics
- Semantic search retrieves candidate chunks
- BM25 retrieves lexical matches
- Results are fused and reranked
- Final sources are selected and passed to the LLM
- The backend returns:
answerconfidencesources- repository metadata
- fast iteration speed
- strong request validation through Pydantic
- simple background task support
- clean fit for JSON APIs and model-driven backend code
- straightforward stateful UI for a single-page workflow
- easy integration with polling, chat state, and citation rendering
- strong ecosystem for incremental iteration
- better chunk boundaries than naive fixed-length splitting
- lets the system reason around functions, classes, and symbols
- improves retrieval quality for implementation-focused questions
Pure semantic search misses exact symbols and file names. Pure lexical search misses semantic intent. This project combines both because code questions often need:
- exact identifiers
- nearby implementation detail
- cross-file semantic similarity
- simple vector abstraction
- one vector database path for both local and production runtime
- persistent local storage without a separate hosted vector service
- direct support for externally generated embeddings and metadata filters
- repository/session metadata is short-lived and cleared when the backend restarts
- no separate relational database is needed for the current product flow
- the API still exposes indexing status and session-scoped repositories while keeping deployment simpler
Local development is configured for higher-quality experimentation:
- Claude Sonnet 4 on Amazon Bedrock for answer generation
- Cohere Embed v4 on Amazon Bedrock for semantic retrieval
This setup is useful for:
- higher quality local experiments
- comparing retrieval and answer quality in a managed-model environment
Recommended local runtime:
LLM_PROVIDER=bedrockEMBEDDING_PROVIDER=bedrockAWS_REGION=us-east-1BEDROCK_LLM_MODEL=anthropic.claude-sonnet-4-20250514-v1:0BEDROCK_EMBEDDING_MODEL=cohere.embed-v4:0BEDROCK_EMBEDDING_DIM=1536CHROMA_PATH=./data/chromaCHROMA_COLLECTION=repo_qa_chunks
The evaluation harness is now designed around Amazon Bedrock:
- Bedrock Claude Opus 4 for the RAGAS judge model
- app-configured embeddings during evaluation
Recommended eval runtime:
EVAL_MODEL=anthropic.claude-opus-4-20250514-v1:0AWS_REGION=us-east-1
The production deployment target is:
- frontend on Vercel
- backend on Hugging Face Spaces
Production inference is configured differently from local development:
- Groq-hosted Llama for answer generation
- lightweight local sentence-transformer embeddings for semantic retrieval
- Chroma DB for vector storage
This production setup was chosen to fit Hugging Face Spaces free-tier constraints more comfortably while keeping the retrieval and answer pipeline intact. Chroma is used in production and local development so the vector storage behavior stays consistent across environments.
Recommended production runtime:
LLM_PROVIDER=groqEMBEDDING_PROVIDER=localLIGHTWEIGHT_LOCAL_EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2CHROMA_PATH=./data/chromaCHROMA_COLLECTION=repo_qa_chunks
- Vercel hosts the React frontend
- Hugging Face Spaces hosts the FastAPI backend
- the backend is packaged and deployed as a Docker Space
- GitHub Actions syncs the backend code to the Space on pushes to
main
The backend is deployed with Docker using:
The container:
- installs Python dependencies
- copies the backend application
- starts the FastAPI app with Uvicorn on port
7860
Continuous deployment is handled through:
The workflow:
- runs on pushes to
main - syncs the
server/directory to the Hugging Face Space - triggers the Docker Space rebuild automatically
The project includes an end-to-end eval harness that calls the live API instead of mocking the retrieval pipeline.
Files:
The benchmark currently measures:
- retrieval hit rate
- top-1 hit rate
- mean reciprocal rank
- source recall
- duplicate source rate
- keyword-based answer checks
- grounded answer rate
- optional RAGAS judge metrics such as faithfulness and answer relevancy
The current RAGAS judge configuration uses Bedrock Claude Opus 4 via EVAL_MODEL.
The project includes a measurable end-to-end evaluation workflow alongside the product itself. Metric values are intentionally left pending until the benchmark is rerun, so the README does not claim unverified results.
Current sample benchmark target:
- Documenso (
https://github.com/documenso/documenso.git) - 43 evaluation cases
- 10 categories
- 4 multi-turn conversation cases
- full-application coverage across architecture, docs, setup, API layers, document flows, signing, email, jobs, tests, and follow-up questions
| Metric | Result |
|---|---|
| Retrieval hit rate | To be added after rerun |
| Top-1 hit rate | To be added after rerun |
| Mean reciprocal rank | To be added after rerun |
| Source recall | To be added after rerun |
| Grounded answer rate | To be added after rerun |
| Keyword/checklist pass rate | To be added after rerun |
| Reference-support pass rate | To be added after rerun |
| Faithfulness (RAGAS, supporting) | To be added after rerun |
| Answer relevancy (RAGAS, supporting) | To be added after rerun |
| Context precision (RAGAS, supporting) | To be added after rerun |
What these numbers mean:
- the system should retrieve at least one relevant source for most benchmark cases
- the first-ranked source is expected to be relevant in most cases, with an internal 80% top-1 target
- the benchmark includes architecture, API, setup, docs, tests, cross-file workflows, code-generation checklists, and conversation-style questions
- RAGAS is treated as a secondary judge signal; deterministic retrieval and grounded checklist metrics are the primary gates
Benchmark strengths:
- full-stack application benchmark rather than a library-only benchmark
- product-domain questions around documents, recipients, fields, signing, emails, jobs, and webhooks
- measurable end-to-end performance instead of anecdotal examples once the new run is complete
Benchmark-exposed weaknesses:
- the sample set is focused on one target project, so it should be broadened before being presented as general benchmark evidence
- Documenso is a large TypeScript monorepo, so context precision and directory-level source selection matter more than in the old library-focused eval
- some cross-file, specific-function, and test-heavy questions may remain harder than single-file API questions
- canonical implementation files may not always rank first on the hardest prompts
- full-stack architecture with a clear data flow
- code-aware retrieval rather than plain document retrieval
- practical hybrid search design
- session-aware repo isolation
- source-grounded answer generation
- explicit benchmark and evaluation workflow
- retrieval state is intentionally session-scoped and mostly in memory
- cloned repositories are temporary and deleted after indexing
- repository metadata is lightweight and persisted separately from vector state
- if the backend restarts, repositories must be re-indexed
- the benchmark is strong for the current project scope and can be expanded further across repositories over time
cd server
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
export LLM_PROVIDER=bedrock
export EMBEDDING_PROVIDER=bedrock
export AWS_REGION=us-east-1
export BEDROCK_LLM_MODEL=anthropic.claude-sonnet-4-20250514-v1:0
export BEDROCK_EMBEDDING_MODEL=cohere.embed-v4:0
export BEDROCK_EMBEDDING_DIM=1536
export EVAL_MODEL=anthropic.claude-opus-4-20250514-v1:0
python server_app.pyBackend runs on http://localhost:8000
cd ui
npm install
npm startFrontend runs on http://localhost:3000
Create ui/.env:
REACT_APP_API_URL=http://localhost:8000From the server directory:
CODEBASE_RAG_API_URL=http://localhost:8000 \
CODEBASE_RAG_SESSION_ID=<session-id> \
CODEBASE_RAG_REPO_ID=<repo-id> \
CODEBASE_RAG_EVAL_OUTPUT=evals/latest_eval_report.json \
python evals/run_eval.pyThe output report includes:
- eval-set audit warnings
- headline metrics
- category breakdowns
- case-by-case detail
- a summary string suitable for project reporting
If you want to save the latest run as a JSON artifact:
CODEBASE_RAG_EVAL_OUTPUT=evals/latest_eval_report.jsonserver/
server_app.py
evals/
src/
ui/
src/
README.md

