A Rust graph database implementing ISO GQL, with single-file storage inspired by SQLite.
Live playground · Website · Query language · Data modification · Storage format
Run GQL in your browser — the website loads the real frogql-wasm engine and a sample movies graph, no install required.
froGQL is a research prototype that grew into a working database. Its differentiator is a typechecker grounded in a gradual type system for GQL: it detects queries that cannot return results because of type mismatches, before they run. We combine hand-written research code with AI-assisted engineering, and we are explicit about the boundary between the two.
The path so far:
-
Theory first. The project grew out of our paper on a gradual type system for GQL, which contributes the formalization and the soundness theorems (we have a background in gradual typing and databases). The central result is Emptiness: statically detecting queries that are guaranteed to return nothing because of type mismatches.
-
A hand-written prototype. We implemented the typechecker and the runtime semantics of the formal model in Python, by hand, without AI. This is the reference everything else is checked against.
-
The question worth testing. Does the typechecker earn its cost? That is, is typecheck + run worthwhile against blindly running a query only to discover it returns nothing? Answering this needed performance the Python prototype could not provide.
-
Port to Rust, with AI. We rewrote the engine in Rust with AI assistance, taking inspiration from SQLite's single-file storage architecture.
-
Worst-case-optimal joins. Joins were slow. Colleagues who work on databases pointed us to the CompactLTJ paper (Arroyuelo et al., VLDB Journal 2025) and its C++ reference implementation. We implemented the Leapfrog Triejoin algorithm with generative AI, keeping the test suite green throughout.
-
LDBC and language growth. To run the LDBC Social Network Benchmark we extended the language with more features, including our centerpiece, the typechecker. We did this carefully to avoid breaking the theorems we had proved formally.
-
Where AI stopped being enough. The benchmarks were strong on some fronts and weak on others. Closing the gaps took design decisions and research that generative AI could not deliver on its own: LTJ interacts poorly with GQL features such as
UNION, and even with plainORDER BY. We grow the system slowly, debating design strategies and testing several before each new feature lands.
Benchmark charts from the LDBC Social Network Benchmark are coming here as the numbers stabilize.
Papers
- Flexible and Expressive Typed Path Patterns for GQL — the gradual type system, formalization, and the Emptiness theorem behind froGQL's typechecker.
- CompactLTJ (Arroyuelo et al., VLDB Journal 2025) — the worst-case-optimal join algorithm froGQL uses for comma-joins and chains.
Python (PyPI):
pip install frogqlWheels ship for CPython 3.8+ on Linux (x86_64, aarch64), macOS (x86_64, arm64), and Windows (x86_64).
CLI / library (build from source):
cargo build --release# Import a CSV dataset and open the REPL
./target/release/frogql movies.gdb --import-csv path/to/Spanner_Instance/
# Open an existing database
./target/release/frogql movies.gdb
# Skip the typechecker for the session
./target/release/frogql movies.gdb --no-typecheckgql> .schema
GRAPH TYPE DEFAULT (active)
Node types:
(:Movie {released INT, title STRING, votes INT, *})
(:Person {name STRING, *})
Edge types:
(:Person)-[:ACTED_IN {roles LIST<STRING>}]->(:Movie)
(:Person)-[:DIRECTED]->(:Movie)
(:Person)-[:REVIEWED {rating INT, summary STRING}]->(:Movie)
...
gql> MATCH (p: Person) -[:ACTED_IN]-> (m: Movie) WHERE m.released = 1999 RETURN p.name, m.title
p.name | m.title
-------+--------
"Keanu Reeves" | "The Matrix"
"Carrie-Anne Moss" | "The Matrix"
...
gql> MATCH (a: Person) -[:ACTED_IN]-> (m: Movie), (d: Person) -[:DIRECTED]-> (m) RETURN a.name, d.name
The REPL follows the SQLite convention: every meta-command starts with ..
| Command | Effect |
|---|---|
.schema |
alias for SHOW GRAPH TYPE DEFAULT (the auto-inferred schema) |
.schema simple |
grouped by-label renderer of the inferred schema |
.graph-types |
alias for SHOW GRAPH TYPES |
.indexes |
alias for SHOW INDEXES |
.save |
atomically persist the in-RAM mutations to the open .gdb (tmp+rename) |
.dump-json <path> |
pg_dump-style JSON snapshot of the merged graph |
.dump-gql <path> |
pg_dump-style GQL script that re-creates the graph |
.help |
list meta-commands and quick query syntax |
.quit / .exit |
exit (bare quit / exit also work) |
Opening a path that does not exist creates an empty database (sqlite3 convention):
./target/release/frogql /tmp/fresh.gdb
creating new database: /tmp/fresh.gdb
gql> INSERT (a:Person {name: 'Alice'})
gql> .saveThe examples/ directory contains ready-to-use .gdb files:
| Database | Nodes | Edges | Domain |
|---|---|---|---|
| movies.gdb | 171 | 253 | Movies (Person, Movie, ACTED_IN, DIRECTED, ...) |
| fraud_detection.gdb | 14,550 | 31,564 | Financial fraud (Transaction, Account, TRIGGERED_ALERT, ...) |
| bom.gdb | 1,500 | 2,248 | Bill of Materials (Product, Assembly, ConsistsOf, ...) |
Regenerate from CSV: ./target/release/frogql examples/movies.gdb --import-csv <csv_dir>
| Topic | One-liner | Details |
|---|---|---|
| Load data | CSV (Spanner / Text2GQL / Cypher), LDBC SNB, or JSON; DEFAULT graph type is inferred at import. |
docs/data-import.md |
| Query | MATCH, OPTIONAL MATCH, comma-joins, WHERE, RETURN, EXISTS, repetition {n,m}, label algebra. |
docs/query-language.md |
| Mutate | INSERT, SET, REMOVE, [DETACH] DELETE (ISO §13 MVP-1). In-RAM overlay until .save. |
docs/data-modification.md |
| Schemas | Persistent named graph types with CREATE / USE / DROP / SHOW / VALIDATE GRAPH TYPE; reserved DEFAULT is auto-inferred. |
docs/graph-types.md |
| Indexes | Auto-built hash + btree on every uniquely-keyed (label, prop) (rebuilt on open); explicit CREATE [HASH | BTREE] INDEX for the rest, persisted in the .gdb header chain so they survive close/reopen. |
docs/secondary-indexes.md |
froGQL uses a single .gdb file with 4KB pages:
- All node/edge IDs are
u32internally (no string overhead at query time) - Three storage backends: in-memory
Graph,LazyGraphStore(topology in RAM, labels/props from disk via LRU page cache),DiskGraphStore(minimal RAM) - The REPL uses
LazyGraphStorefor efficient memory usage on large graphs - The GRAPH TYPE catalog persists in its own page chain, so
CREATE / USE / DROP GRAPH TYPEsurvive close/reopen
See docs/internals/storage-architecture.md for the full format specification.
Comma-joins and chains of directed/undirected edges are executed with
Leapfrog Triejoin (LTJ), a worst-case-optimal multi-way join that
binds variables one at a time across all participating patterns
simultaneously, with no intermediate materialisation. Each directed edge
is modelled as a triple (src, label, tgt) indexed in six sorted
orderings. LTJ activates automatically when the pattern decomposes into
triples; non-decomposable shapes (any-direction edges, repetitions with
named variables, unbounded {n,}) fall back to pairwise hash-join.
Speedups on soc-LiveJournal1-100k (limit 1000) range from
14× (3-clique) to 4097× (4-path), with a 4-clique going from "hung"
to 43 ms. See docs/internals/JOIN_STRATEGY_NOTES.md
and CLAUDE.md for the algorithm in detail.
The compiler pipeline is parse → elaborate → typecheck → optimize → run.
The optimizer is reserved for performance-preserving rewrites; ISO
syntactic sugar lives in elaborate. Major passes:
- WHERE pushdown into descriptor
value_predsandvalue_filters(type ascriptions likeis T, value comparisons like<,<=,=). - Index-driven constant folding:
Eqpredicates that hit a hash index resolve to a single NodeId, get pre-bound in the result tuple, and excluded from the VEO.</<=/>/>=that hit a btree precompute the matching sorted set and become an O(log n) membership test. - Bounded
Repeatunrolling:(P){lb, ub}with single-edge inner andub - lb + 1 ≤ 4rewrites toUnion(P^lb, …, P^ub)and distributes the Union out of any surrounding Concat — each arm becomes a flat chain that decomposes natively into triples. - ORDER BY alias resolution:
ORDER BY <RETURN-alias>lowers to the underlyingAttrLookupso the runtime routes the sort through the pre-projection top-k heap (and the btree-driven path) instead of fully projecting every row before truncating. - OPTIONAL MATCH bind-pushdown (runtime-side, in
run_match_chain): for each outer row, the inner pattern runs as a small LTJ with the shared variables pre-pinned — SQLite's correlated nested-loop with index lookup, mapped onto the LTJ. Cuts a representative LDBC IS5 query from 52 s → 0.56 s on SF0.1 (~93×). - BTree-driven top-k: when the sort key has a btree (auto-built or
declared) the runtime walks the btree in primary-attr order, runs an
LTJ pin per id, and stops at
LIMIT k(or at the first cohort boundary pastkfor multi-spec ORDER BY). Multi-labelOrdescriptors (Comment | Post) drive the search via a k-way merge over per-label btrees. Drops LDBC IC2 on SF0.1 from 3.0 s → 0.086 s with the right indexes declared.
See docs/internals/implemented-optimizations.md
for benchmark numbers and code-level detail.
cargo build --release # build all binaries
# Strict clippy (run before every commit)
cargo clippy --workspace --all-targets -- -D clippy::all
# Lib + integration sweep (bench_test is excluded — pre-existing failures)
cargo test --lib
cargo test --test parser_test --test runtime_test --test store_runtime_test \
--test text2gql_test --test parse_and_run_test --test count_test \
--test null_test --test record_test --test list_test \
--test compile_diagnostics --test elaborate_test --test float_test \
--test graph_type_test --test typecheck_smoke --test typecheck_test \
--test optional_match_test --test multi_match_test \
--test aggregates_proptest --test lattice_proptest --test multi_match_proptest \
--test exists_fold_test --test exists_runtime_test \
--test parser_dm_test --test lazy_mut_test --test dm_runtime_test \
--test dm_persistence_test --test dm_schema_test --test dm_default_test \
--test dump_test --test dm_set_test --test dm_remove_test \
--test dm_label_test --test dm_delete_expr_test
# Single test
cargo test --test runtime_test test_join_star_any_label -- --exactOther binaries in src/bin/:
bench_queries— generic benchmark runnerbench_setup— downloads + extracts LDBC datasets (bench/data/is gitignored)ldbc_bench— LDBC interactive-complete driver, queries inbench/ldbc-queries/*.tomlinternal_bench— gqlite-only diagnostic bench (typechecker on/off, lazy/disk backend)convert_edgelist— edge-list format converter
User-facing:
docs/data-import.md— CSV / LDBC / JSON loadersdocs/query-language.md—MATCH,OPTIONAL MATCH,EXISTS, repetition, label algebradocs/data-modification.md— ISO §13 surface (INSERT,SET,REMOVE,DELETE), persistence, dumpsdocs/graph-types.md— Catalog DDL,DEFAULTlifecycle, validationdocs/secondary-indexes.md— Auto-build, declared indexes, LTJ wiring
Internals:
docs/internals/storage-architecture.md— File format, page layout, storage backends, catalog persistencedocs/internals/JOIN_STRATEGY_NOTES.md— LTJ algorithm, triple decomposition, benchmark numbersdocs/internals/implemented-optimizations.md— u32 IDs, hash-join, label index, early termination, predicate pushdowndocs/internals/possible-optimizations.md— Fixed-length cells, sorted adjacency, future LTJ workdocs/internals/iso-gql-gaps.md— Tracked deviations from ISO GQLdocs/internals/graph-type-catalog-plan.md— Catalog design rationaleCLAUDE.md— Internal architecture notes (compiler pipeline, typechecker, LTJ details)
