A relational database engine built from scratch in Python — storage, indexes, SQL, a cost-based optimizer, and crash recovery.
QueryX is a miniature but architecturally faithful relational database engine, built from first principles to expose the internals that production databases (PostgreSQL, SQLite) hide behind mature abstractions: page-based storage, B+ tree and hash indexing, SQL parsing, volcano-model execution, cost-based query optimization, and write-ahead logging with crash recovery. The engine uses only the Python standard library — no ORM, no database libraries.
Status: complete (Phases 1–9), 337 passing tests. A working SQL database with paged storage, B+ tree/hash indexes, a cost-based optimizer with
EXPLAIN, WAL crash recovery, a charted benchmark suite, plusGROUP BY/HAVING, a two-tableINNER JOIN, and a workload-driven index advisor.
A SQL string falls downward through a stack of layers, getting more concrete at each step; dependencies only ever point down (an upper layer may use a lower one, never the reverse). A write-ahead log guards every write.
flowchart TD
SQL["SQL text"] --> LEX["sql/ — lexer + recursive-descent parser → AST"]
LEX --> OPT["planner/ — cost-based optimizer + EXPLAIN"]
OPT --> EXE["execution/ — volcano operators (open / next / close)"]
EXE --> IDX["index/ — B+ tree · static hash index"]
EXE --> STO["storage/ — 4KB pages · LRU buffer pool · heap files"]
IDX --> STO
CAT["catalog — tables / columns / indexes"] -. metadata .-> OPT
WAL["wal/ — write-ahead log + crash recovery"] -. guards every write .-> STO
| Layer | Package | Responsibility |
|---|---|---|
| SQL | queryx/sql/ |
Tokenize and parse SQL text into an AST |
| Planner | queryx/planner/ |
Cost-based optimization; choose the cheaper plan; EXPLAIN |
| Execution | queryx/execution/ |
Run the plan via the volcano (iterator) model |
| Index | queryx/index/ |
B+ tree and hash index: key → row location |
| Storage | queryx/storage/ |
4KB pages, pager, LRU write-back buffer pool, heap files |
| WAL | queryx/wal/ |
Write-ahead log + crash recovery (redo) |
| Catalog | queryx/catalog.py |
System catalog: tables, columns, indexes |
| Facade | queryx/database.py |
Wires the pipeline; the public db.execute(...) API |
flowchart LR
Q["SELECT name FROM users WHERE age > 30"] --> T["tokens"]
T --> A["AST"]
A --> P["plan: SeqScan vs IndexScan, chosen by cost"]
P --> O["operators pull rows: Projection ← Filter ← Scan"]
O --> R["result rows"]
| Component | Status |
|---|---|
| Storage — 4KB slotted pages, pager, LRU write-back buffer pool, heap files | ✅ |
| Indexes — disk-backed B+ tree (ordered range scans), static hash index | ✅ |
| SQL — hand-written lexer + recursive-descent parser → typed AST (BNF) | ✅ |
| Execution — volcano operators: scan, filter, project, sort, limit, distinct | ✅ |
Aggregates — COUNT/SUM/AVG/MIN/MAX, scalar and GROUP BY / HAVING; ORDER BY an aggregate (top-N groups) |
✅ |
Joins — two-table INNER JOIN (nested-loop + index-nested-loop) |
✅ |
Optimizer — cost-based SeqScan vs IndexScan, with EXPLAIN |
✅ |
| Durability — write-ahead log + redo crash recovery | ✅ |
Adaptive indexing — workload advisor (.recommend / .apply) |
✅ |
| Tests | 337 passing |
Requires Python 3.11+.
pip install -e ".[test]" # install (stdlib-only engine; pytest for the tests)
python -m queryx mydb # open an interactive SQL shell on the "mydb" database
pytest # run the full test suite (337 tests)from queryx.database import Database
db = Database("mydb") # a directory holding the catalog + table/index files
db.execute("CREATE TABLE users (id INT, name TEXT, age INT)")
db.execute("INSERT INTO users VALUES (1, 'alice', 30), (2, 'bob', 25)")
result = db.execute("SELECT name, age FROM users WHERE age >= 30 ORDER BY name")
print(result.columns) # ['name', 'age']
print(result.rows) # [('alice', 30)]
db.execute("SELECT COUNT(*), AVG(age) FROM users").rows # [(2, 27.5)]
db.close() # data persists to disk; reopen Database("mydb") to read it backqueryx> SELECT name, age FROM users WHERE age >= 30 ORDER BY name;
+-------+-----+
| name | age |
+-------+-----+
| alice | 30 |
| carol | 30 |
+-------+-----+
(2 rows) [0.2 ms]
queryx> .stats
buffer pool
hit ratio : 100.0% (5 hits / 0 misses)
storage
data pages : 1 index pages: 0
Meta-commands: .help, .tables, .schema, .indexes, .stats (buffer-pool
hit ratio), .pages (on-disk slotted-page layout), .tree (a B+ tree index as
static ASCII), .recommend / .apply (index advisor), .quit. A paste-ready
tour of every feature is in examples/sample_queries.sql.
Two single-command, narrated demos (no setup):
python examples/crash_recovery_demo.py # WAL redo recovery after a crash
python examples/index_vs_seqscan_demo.py # EXPLAIN + timing: index vs full scan-
Crash recovery — inserts rows, simulates a crash with no checkpoint, zeroes the data page on disk, reopens, and replays the WAL to recover the rows (byte-level before/after proof). Runs identically every time. Recovery succeeds because the corruption happens before a checkpoint — this demonstrates the WAL's redo guarantee, not protection against arbitrary disk faults (after a checkpoint the log is truncated and the page image is gone).
-
Index vs sequential scan — on one query,
EXPLAINshows aSeqScanwith no index, then anIndexScanafterCREATE INDEX, with the measured speedup.
Rendered from the demo's real output by examples/render_crash_recovery_gif.py (reproducible — no screen capture).
In-process microbenchmarks at N = 20,000 rows (warm buffer pool, no per-op
fsync) — they show relative algorithmic behaviour, not production latencies,
and shift run-to-run. The numbers below match the committed
benchmark report; regenerate with
python benchmarks/benchmark_suite.py.
| operation (ops/s) | heap / seqscan | B+ tree | hash |
|---|---|---|---|
| insert | 33,013 | 7,759 | 37,306 |
| point lookup | 77 | 8,459 | 21,648 |
| range scan (500-key window) | 34 | 1,350 | — |
- A hash point lookup is ~280× a sequential scan at this size; the B+ tree trails hash on point lookups but is the only index that range-scans.
- WAL overhead: page writes are ~2.7× slower with logging on (70,113 → 25,606 ops/s) — the price of crash durability.
An architectural comparison — how QueryX's design mirrors, and deliberately simplifies, production engines. This is not a feature- or performance-parity claim; QueryX implements the core mechanisms on a small, focused scope.
| Concept | PostgreSQL | SQLite | QueryX (this project) |
|---|---|---|---|
| Page size | 8 KB | 4 KB (default) | 4 KB |
| Record layout | slotted (item pointers) | B-tree cells | slotted pages |
| Buffer management | shared buffers, clock-sweep | page cache | LRU, write-back |
| Default index | B-tree | B-tree | B+ tree (plus a static hash index) |
| Optimizer | cost-based; histograms, MCV lists, join enumeration | cost-based | cost-based, single-table SeqScan vs IndexScan; 1/n_distinct + magic-number selectivity |
| Durability | WAL + full-page images (ARIES-style redo/undo) | rollback / WAL journal | WAL, full-page-image redo replay (no undo) |
| Transactions | full ACID, MVCC | ACID | none — durability only; no isolation or atomic multi-statement transactions |
| System catalog | pg_catalog (in tables) |
sqlite_master |
JSON file |
Where QueryX simplifies, it does so on purpose — see the failure analysis and "future work" rationale in DESIGN.md.
A focused, fully integrated subset — not "all SQL". Every supported command runs through the real pipeline (parser → optimizer → executor → indexes → storage), never string-matched.
- Supported:
CREATE TABLE,DROP TABLE,INSERT(multi-row),SELECT [DISTINCT] cols FROM t [JOIN t2 ON ...] WHERE <predicate> [GROUP BY ... HAVING ...] [ORDER BY col | aggregate] [LIMIT],UPDATE,DELETE,CREATE INDEX,DROP INDEX,EXPLAIN; comparisons= != <> < > <= >=combined withAND OR NOT; aggregatesCOUNT(*) SUM AVG MIN MAXwith or withoutGROUP BY; a two-tableINNER JOIN. - Deferred (see Future Work): subqueries,
LIKE,IN, joins beyond two tables, foreign keys, views, full three-valuedNULLlogic, andBEGIN/COMMIT/ROLLBACKtransactions.
- DESIGN.md — architecture, BNF grammar, per-phase design decisions, complexity, Postgres/SQLite comparisons, and failure analysis.
- benchmarks/REPORT.md — the benchmark report.
- RESUME.md — resume bullet points.
These are scope decisions, not oversights — each is a substantial subsystem that would add tooling breadth without deepening the core internals this project exists to demonstrate. Knowing why they are deferred is part of the design:
- Transactions, MVCC, locking, isolation — QueryX provides durability (WAL redo) but not atomic multi-statement transactions or concurrency control. This is the single biggest simplification.
- Richer SQL — subqueries,
LIKE/IN, joins beyond two tables, foreign keys, views, full three-valuedNULLlogic. - Optimizer depth — histograms / correlation statistics, join-order
enumeration (only single-table access-path selection today). Range predicates
currently fall back to
SeqScan— the fixed1/3range-selectivity estimate makes a scan look cheaper than the index; a histogram would let the optimizer pick the B+ tree's range scan. - Storage/ops — page compaction, columnar storage, compression, parallel execution, plan caching, monitoring dashboards.
MIT.

