Code Search App

A Flask-based web application for indexing, searching, and exploring Python codebases stored in ChromaDB. The app ingests a source directory, splits the code into semantic chunks using Abstract Syntax Trees and token-bounded splitting, and provides a browser UI for semantic and regex search over the resulting collection.

Overview

When you point the app at a source directory, it walks every supported file, parses Python files using tree-sitter to extract function and class definitions as atomic units, and stores each chunk alongside rich metadata (file path, line range, symbol name, chunk type) in a ChromaDB persistent collection. You can then search the indexed codebase using natural-language queries (semantic search via OpenAI embeddings) or regular expressions (full-text filter via ChromaDB's $regex operator).

Teaching material — Students ingest this codebase into ChromaDB using the AST-based chunking pipeline they build in the labs. The well-structured Python code (models, services, routes, utils) makes it an ideal target for practicing chunking strategies.
Interactive tool — Once ingested, students launch this app to explore their collections, run searches, and see how their chunking and metadata decisions affect retrieval quality.

Features

Semantic search — Natural language queries over code using OpenAI embeddings (text-embedding-3-small)
Regex search — Structural pattern matching across the codebase with analysis and explanation
Collection explorer — Paginated chunk browser with filters by file path, chunk type, and symbol name
Code statistics — Construct detection, size distributions, and symbol rankings
Embedding visualizer — 2-D projection of chunk embeddings to explore clustering
Smart suggestions — Context-aware query suggestions based on collection metadata
Query history and bookmarks — Persistent search history with color-coded bookmarks
Interactive tutorials — Guided tours with spotlight overlays for onboarding

How Search Works

The app searches the indexed code in two complementary ways.

Semantic search ranks code chunks by how close they are to your query in meaning, using the same embedding model that indexed them. It finds relevant code even when your wording does not match the wording in the code.

Regex search matches an exact pattern across the chunks. It is the right tool for a specific identifier, a function call, or a structural pattern that semantic search would only approximate.

Searching the collection

Semantic search ranks code chunks by how close they are to your query. By default it returns the top 10 matches, and you can ask for as many as 50. A query must be at least 2 characters, and no more than 500.

Every result carries its chunk's metadata — the file path, the line range, and the symbol name — so you can trace a match back to its place in the source.

How Ingestion Works

Before anything can be searched, the code has to be ingested into a ChromaDB collection. The pipeline walks the repository, reads each Python file, and splits it into chunks along its structure rather than by raw length.

It uses tree-sitter to parse each file into an abstract syntax tree, then takes whole functions and classes as atomic chunks. Code that falls between definitions, such as imports and top-level statements, becomes its own "gap" chunk, so nothing in the file is dropped. A chunk that would exceed the embedding model's token limit of 1000 tokens is split further along line boundaries.

Each chunk is stored with metadata: its file path, its start and end line, its symbol name, and its chunk type (function, class, module, or gap). Only Python files are ingested.

Setup

Progress is streamed back as server-sent events. Once complete, the collection is searchable immediately.

cd app
pip install -r requirements.txt

Project Structure

app/
├── app.py                  # Flask application factory and entry point
├── config.py               # Dataclass-based configuration (env vars, defaults)
├── requirements.txt        # Python dependencies
├── .env.example            # Environment variable template
├── ARCHITECTURE.md         # System architecture and design overview
├── CHANGELOG.md            # Version history and release notes
├── CONTRIBUTING.md         # Contribution guidelines and conventions
│
├── models/                 # Data models
│   ├── chunk.py            # Chunk, ChunkMetadata, ChunkType
│   ├── search_result.py    # SearchResult, SearchResultSet, ResultFormatter
│   └── query_history.py    # QueryRecord, Bookmark, HistoryManager
│
├── routes/                 # Flask blueprints (one per feature)
│   ├── search.py           # Semantic and regex search endpoints
│   ├── collections.py      # Collection CRUD and ingestion triggers
│   ├── explorer.py         # Paginated chunk browsing with filters
│   ├── similarity.py       # Pairwise similarity matrix computation
│   ├── history.py          # Query history and bookmarks API
│   ├── regex_tester.py     # Regex testing and analysis
│   ├── suggestions.py      # Smart query suggestions
│   ├── statistics.py       # Code metrics and analytics
│   ├── visualizer.py       # 2-D embedding visualization
│   ├── tutorial.py         # Interactive guided tours
│   ├── diff.py             # Collection diff endpoints
│   └── export.py           # Collection export endpoints (CSV/JSON)
│
├── services/               # Business logic layer
│   ├── chroma_client.py    # ChromaDB connection manager (singleton)
│   ├── search_service.py   # Search strategies (semantic + regex)
│   ├── collection_service.py   # Collection management and stats
│   ├── ingestion_service.py    # AST parsing and code chunking pipeline
│   ├── similarity_service.py   # Vector similarity computations
│   ├── statistics_service.py   # Code metrics and analysis
│   ├── visualization_service.py # Dimensionality reduction for the 2-D view
│   ├── suggestion_service.py   # Multi-strategy suggestion generator
│   ├── tutorial_service.py     # Tutorial builder and manager
│   ├── diff_service.py         # Collection comparison logic
│   └── export_service.py       # Collection export logic (CSV/JSON)
│
├── utils/                  # Utilities and helpers
│   ├── validators.py       # Input validation (queries, paths, regex)
│   ├── regex_engine.py     # Regex analysis and human-readable explanation
│   ├── code_parser.py      # Lightweight regex-based Python parser
│   ├── text_splitter.py    # Token-based text splitting
│   └── formatters.py       # Display formatting (scores, code, paths)
│
├── templates/              # Jinja2 HTML templates
│   ├── base.html           # Base layout with navbar and tutorial engine
│   ├── index.html          # Dashboard (collection cards)
│   ├── search.html         # Search interface
│   ├── explorer.html       # Chunk browser
│   └── collection.html     # Collection detail page
│
├── static/
│   └── css/style.css       # Custom styles
│
├── docs/                   # Documentation
│   ├── api-reference.md    # Full API endpoint reference
│   ├── configuration.md    # Configuration reference
│   ├── development.md      # Local development guide
│   └── troubleshooting.md  # Common issues and fixes
│
└── notes/                  # Developer notes
    ├── design-decisions.txt  # Design decision log
    └── performance-notes.txt # Performance observations and notes

The app starts on http://localhost:5000 by default. Navigate to that URL in a browser to access the UI.

Environment Variables

Variable	Required	Description
`OPENAI_API_KEY`	Yes	Used to generate embeddings via `text-embedding-3-small`
`CHROMA_PERSIST_DIR`	No	Directory for ChromaDB storage (default: `./chroma_data`)
`FLASK_ENV`	No	Set to `development` for debug mode (default: `development`)
`SECRET_KEY`	No	Flask session secret (use a random string in production)

Configuration

The app reads its settings from config.py, overridable by environment variables. The defaults that matter most:

Setting	Default	Meaning
Embedding model	`text-embedding-3-small`	model used to embed code and queries
Default results	10	matches returned per search
Maximum results	50	upper limit on matches per search
Query length	2–500 characters	accepted query size
Max tokens per chunk	1000	a larger chunk is split along line boundaries
Regex result cap	100	maximum matches a regex search returns
Regex timeout	5 seconds	a regex search is abandoned after this

Limitations

A few constraints are worth knowing before you rely on the app:

Python only. Ingestion handles .py files; other languages are skipped.
Regex search is bounded. It returns at most 100 matches and times out after 5 seconds, so a very broad pattern may not surface everything.
The embedding view is a projection. The 2-D visualizer reduces high-dimensional embeddings down to two dimensions, which is useful for spotting clusters but loses detail.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Code Search App

Overview

Features

How Search Works

Searching the collection

How Ingestion Works

Setup

Project Structure

Environment Variables

Configuration

Limitations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
docs		docs
models		models
notes		notes
routes		routes
services		services
static		static
templates		templates
utils		utils
.env.example		.env.example
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
README.md		README.md
app.py		app.py
config.py		config.py
requirements.txt		requirements.txt
search_history.json		search_history.json

Folders and files

Latest commit

History

Repository files navigation

Code Search App

Overview

Features

How Search Works

Searching the collection

How Ingestion Works

Setup

Project Structure

Environment Variables

Configuration

Limitations

About

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages