Skip to content

hanyucrocks/codeprobe

Repository files navigation

CodeProbe

Semantic code search engine with hybrid retrieval, AST-aware chunking, and RAG — built and benchmarked against Zulip's ~500 K-line codebase.

Demo

Search results — each chunk card shows its file path, line range, RRF-fused score, and source/language/chunk-type badges. The hybrid pipeline surfaces both semantically similar and exact-identifier matches; the cross-encoder reranker orders them by relevance.

Search results view showing hybrid source tags and rerank scores

Ask with citations — the RAG answer is grounded in the retrieved chunks. Every claim links back to a specific file:line range so you can jump directly to the source.

Ask view showing grounded answer with file:line citations

Architecture

flowchart LR
    subgraph Ingest
        A[Git repo] --> B[File discovery\n& SHA tracking]
        B --> C[AST chunker\ntree-sitter]
        C --> D[Embedder\nall-MiniLM-L6-v2]
        D --> E[(Qdrant\nvector index)]
        C --> F[(BM25\nin-memory index)]
    end

    subgraph Query
        G[User query] --> H[Embed query]
        H --> I[Vector search\nQdrant]
        G --> J[BM25 keyword\nsearch]
        I & J --> K[Reciprocal\nRank Fusion]
        K --> L[Cross-encoder\nreranker]
        L --> M[Top-N chunks]
        M --> N[LLM\nClaude Haiku 4.5]
        N --> O[Answer +\ncitations]
    end

    E --> I
    F --> J
Loading

Quick start

# 1 — spin up Qdrant + API + frontend
docker compose up -d

# 2 — ingest a repo (runs in ~9.5 min for Zulip)
make ingest REPO=data/repos/zulip

# 3 — ask a question
make ask Q="How do I cache a function result keyed by its arguments?"

# or open the UI
open http://localhost:5173

The API is available at http://localhost:8000; interactive docs at /docs.

Benchmark results

Evaluated on a 40-question Zulip question set (eval/questions.jsonl).

Retrieval

Metric Score
Hit@5 85.0 %
Retrieval@5 72.1 %
Retrieval@10 80.8 %

AST chunking vs. line-based chunking

Chunker Hit@5 R@5
Line-based (baseline) 67.5 % 57.9 %
AST-aware (tree-sitter) 85.0 % 72.1 %
Delta +17.5 pp +14.2 pp

Generation quality (7-question sample, LLM-as-judge)

Metric Score
Faithfulness 0.96 / 1.0
Relevance 0.96 / 1.0

Latency

Operation P50
Search (embed + retrieve + rerank) ~2.4 s
Full ask (search + Claude generate) ~5–8 s
Ingest (6,392 chunks / 846 files) ~9.5 min

Tech stack & rationale

Component Choice Why
Embeddings all-MiniLM-L6-v2 (local) Fast, no API cost, good semantic recall on code
Vector DB Qdrant HNSW index, payload filters, Docker-friendly
Keyword search BM25 (rank_bm25) Exact identifier matching that dense search misses
Fusion Reciprocal Rank Fusion (k=60) Parameter-free, robust across score scales
Reranker ms-marco-MiniLM-L-6-v2 cross-encoder Re-scores top-50 candidates; +14 pp Hit@5 over RRF alone
AST chunking tree-sitter (Python + TypeScript) Function/class boundaries beat arbitrary line splits
LLM Claude Haiku 4.5 (Anthropic) Fast, cheap per-token, supports Gemini/Ollama swap via config.yaml
API FastAPI Async, automatic OpenAPI docs, Pydantic validation
Frontend React + Vite + Tailwind Minimal; no framework lock-in

Why hybrid retrieval?

Dense vector search excels at semantic similarity ("how does X work?") but struggles with exact identifier matches ("find send_message_backend"). BM25 is the inverse — perfect recall for exact tokens, poor semantic generalisation.

Reciprocal Rank Fusion merges the two ranked lists without needing calibrated scores: each document's fused score is Σ 1/(k + rank_i) across retrievers. In practice this recovers candidates that either retriever misses alone, and the cross-encoder reranker then applies a richer relevance signal to the merged top-50. The result: +17.5 pp Hit@5 over a pure dense-only baseline.

Limitations & future work

Current limitations

  • Recall vs. ranking gap: Retrieval@10 (80.8 %) is the ceiling for reranking. Questions that fall outside the top-10 candidates can never be answered correctly regardless of reranker quality — expanding the candidate pool (top-50 or top-100) before reranking would push this ceiling up.
  • CPU cross-encoder latency floor: The reranker runs on CPU; reranking 50 candidates takes ~1–2 s. A GPU or a lighter model (e.g. ms-marco-TinyBERT) would cut this significantly.
  • Language coverage: AST chunking is currently Python + TypeScript only. Go, Rust, Java, and C fall back to the line-based chunker, reducing chunk quality for polyglot repos.

v2 roadmap

  • Query expansion: generate 3–5 hypothetical code snippets (HyDE) or sub-queries before retrieval to improve recall on abstract questions.
  • Code-aware embeddings: swap all-MiniLM-L6-v2 for CodeBERT or UniXcoder to better capture code structure rather than prose semantics.
  • tree-sitter multi-language: add Go, Rust, TypeScript (deeper), and Java grammars to extend AST-level chunking to the full long tail of polyglot projects.
  • Incremental re-indexing: currently a full re-ingest is required on repo changes; a file-level SHA diff + partial Qdrant upsert would bring this from minutes to seconds.
  • Eval set expansion: grow from 40 to 80–100 questions for tighter confidence intervals on Retrieval@K and faithfulness.

About

Semantic code search engine — hybrid BM25 + vector retrieval, AST-aware chunking, cross-encoder reranking, and RAG. Benchmarked on Zulip (~500K LOC): Hit@5 85%, R@10 80.8%.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors