CodeProbe

Semantic code search engine with hybrid retrieval, AST-aware chunking, and RAG — built and benchmarked against Zulip's ~500 K-line codebase.

Demo

Search results — each chunk card shows its file path, line range, RRF-fused score, and source/language/chunk-type badges. The hybrid pipeline surfaces both semantically similar and exact-identifier matches; the cross-encoder reranker orders them by relevance.

Ask with citations — the RAG answer is grounded in the retrieved chunks. Every claim links back to a specific file:line range so you can jump directly to the source.

Architecture

flowchart LR
    subgraph Ingest
        A[Git repo] --> B[File discovery\n& SHA tracking]
        B --> C[AST chunker\ntree-sitter]
        C --> D[Embedder\nall-MiniLM-L6-v2]
        D --> E[(Qdrant\nvector index)]
        C --> F[(BM25\nin-memory index)]
    end

    subgraph Query
        G[User query] --> H[Embed query]
        H --> I[Vector search\nQdrant]
        G --> J[BM25 keyword\nsearch]
        I & J --> K[Reciprocal\nRank Fusion]
        K --> L[Cross-encoder\nreranker]
        L --> M[Top-N chunks]
        M --> N[LLM\nClaude Haiku 4.5]
        N --> O[Answer +\ncitations]
    end

    E --> I
    F --> J

Quick start

# 1 — spin up Qdrant + API + frontend
docker compose up -d

# 2 — ingest a repo (runs in ~9.5 min for Zulip)
make ingest REPO=data/repos/zulip

# 3 — ask a question
make ask Q="How do I cache a function result keyed by its arguments?"

# or open the UI
open http://localhost:5173

The API is available at http://localhost:8000; interactive docs at /docs.

Benchmark results

Evaluated on a 40-question Zulip question set (eval/questions.jsonl).

Retrieval

Metric	Score
Hit@5	85.0 %
Retrieval@5	72.1 %
Retrieval@10	80.8 %

AST chunking vs. line-based chunking

Chunker	Hit@5	R@5
Line-based (baseline)	67.5 %	57.9 %
AST-aware (tree-sitter)	85.0 %	72.1 %
Delta	+17.5 pp	+14.2 pp

Generation quality (7-question sample, LLM-as-judge)

Metric	Score
Faithfulness	0.96 / 1.0
Relevance	0.96 / 1.0

Latency

Operation	P50
Search (embed + retrieve + rerank)	~2.4 s
Full ask (search + Claude generate)	~5–8 s
Ingest (6,392 chunks / 846 files)	~9.5 min

Tech stack & rationale

Component	Choice	Why
Embeddings	`all-MiniLM-L6-v2` (local)	Fast, no API cost, good semantic recall on code
Vector DB	Qdrant	HNSW index, payload filters, Docker-friendly
Keyword search	BM25 (`rank_bm25`)	Exact identifier matching that dense search misses
Fusion	Reciprocal Rank Fusion (k=60)	Parameter-free, robust across score scales
Reranker	`ms-marco-MiniLM-L-6-v2` cross-encoder	Re-scores top-50 candidates; +14 pp Hit@5 over RRF alone
AST chunking	tree-sitter (Python + TypeScript)	Function/class boundaries beat arbitrary line splits
LLM	Claude Haiku 4.5 (Anthropic)	Fast, cheap per-token, supports Gemini/Ollama swap via `config.yaml`
API	FastAPI	Async, automatic OpenAPI docs, Pydantic validation
Frontend	React + Vite + Tailwind	Minimal; no framework lock-in

Why hybrid retrieval?

Dense vector search excels at semantic similarity ("how does X work?") but struggles with exact identifier matches ("find send_message_backend"). BM25 is the inverse — perfect recall for exact tokens, poor semantic generalisation.

Reciprocal Rank Fusion merges the two ranked lists without needing calibrated scores: each document's fused score is Σ 1/(k + rank_i) across retrievers. In practice this recovers candidates that either retriever misses alone, and the cross-encoder reranker then applies a richer relevance signal to the merged top-50. The result: +17.5 pp Hit@5 over a pure dense-only baseline.

Limitations & future work

Current limitations

Recall vs. ranking gap: Retrieval@10 (80.8 %) is the ceiling for reranking. Questions that fall outside the top-10 candidates can never be answered correctly regardless of reranker quality — expanding the candidate pool (top-50 or top-100) before reranking would push this ceiling up.
CPU cross-encoder latency floor: The reranker runs on CPU; reranking 50 candidates takes ~1–2 s. A GPU or a lighter model (e.g. ms-marco-TinyBERT) would cut this significantly.
Language coverage: AST chunking is currently Python + TypeScript only. Go, Rust, Java, and C fall back to the line-based chunker, reducing chunk quality for polyglot repos.

v2 roadmap

Query expansion: generate 3–5 hypothetical code snippets (HyDE) or sub-queries before retrieval to improve recall on abstract questions.
Code-aware embeddings: swap all-MiniLM-L6-v2 for CodeBERT or UniXcoder to better capture code structure rather than prose semantics.
tree-sitter multi-language: add Go, Rust, TypeScript (deeper), and Java grammars to extend AST-level chunking to the full long tail of polyglot projects.
Incremental re-indexing: currently a full re-ingest is required on repo changes; a file-level SHA diff + partial Qdrant upsert would bring this from minutes to seconds.
Eval set expansion: grow from 40 to 80–100 questions for tighter confidence intervals on Retrieval@K and faithfulness.

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
codeprobe		codeprobe
eval		eval
frontend		frontend
scripts		scripts
tests		tests
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
Makefile		Makefile
PRD.md		PRD.md
README.md		README.md
TRD.md		TRD.md
config.yaml		config.yaml
diagnose_misses.py		diagnose_misses.py
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CodeProbe

Demo

Architecture

Quick start

Benchmark results

Retrieval

AST chunking vs. line-based chunking

Generation quality (7-question sample, LLM-as-judge)

Latency

Tech stack & rationale

Why hybrid retrieval?

Limitations & future work

Current limitations

v2 roadmap

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CodeProbe

Demo

Architecture

Quick start

Benchmark results

Retrieval

AST chunking vs. line-based chunking

Generation quality (7-question sample, LLM-as-judge)

Latency

Tech stack & rationale

Why hybrid retrieval?

Limitations & future work

Current limitations

v2 roadmap

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages