Skip to content

JinBa1/java-query-engine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

cuckooDB

CI Coverage Dependencies

An in-memory relational query engine built on the Volcano/iterator model. Parses SQL via JSqlParser, builds an operator tree, and executes queries tuple-by-tuple against CSV data.

Architecture

SQL → JSqlParser → QueryPlanner → QueryPlanOptimizer → Operator Tree → Results

Core components:

Component Role
QueryPlanner Parses SQL and builds the operator pipeline
QueryPlanOptimizer Selection pushdown, trivial operator removal
DBCatalog Schema and table metadata (singleton)
Value Typed tuple values (sealed interface: IntValue, StringValue)
ExpressionEvaluator Evaluates WHERE/HAVING conditions per tuple
ExpressionPreprocessor Resolves column references to indices
ConditionSplitter Separates join predicates from selection predicates

Operator hierarchy (all extend Operator):

ScanOperatorSelectOperatorProjectOperatorJoinOperator / HashJoinOperatorSortOperatorAggregateOperatorDuplicateEliminationOperatorLimitOperator

Feature Matrix

Feature Status
SELECT * / projection ✅ Supported
WHERE predicates ✅ Supported
Inner joins (nested-loop) ✅ Supported
Hash join (auto-selected for equi-joins) ✅ Supported
ORDER BY ✅ Supported
GROUP BY + SUM, COUNT, AVG, MIN, MAX ✅ Supported
LIMIT n ✅ Supported
DISTINCT ✅ Supported
Nested arithmetic/comparison expressions ✅ Supported
Query optimisation (selection pushdown) ✅ Supported
Typed columns (int, string) ✅ Supported
CSV header support ✅ Supported
Query budgets (--max-tuples, --timeout-ms) ✅ Supported
EXPLAIN plan inspection ✅ Supported
Indexes ❌ Not supported
Transactions ❌ Not supported
INSERT / UPDATE / DELETE ❌ Not supported
Concurrency ❌ Not supported
Persistence ❌ Not supported
Full SQL dialect ❌ Not supported

Scope

This engine supports SQL-over-CSV query execution: read-only queries against tables stored as CSV files. It does not support transactions, indexes, data modification (INSERT/UPDATE/DELETE), concurrency, persistence, or a full SQL dialect. Values are typed int or string, inferred per column from the data. Tables are discovered from CSV files with header rows; no separate schema file.

Supported SQL features include SELECT/FROM/WHERE, GROUP BY with SUM, COUNT, AVG, MIN, and MAX aggregates, ORDER BY, DISTINCT, inner joins, and LIMIT n.

The focus is on demonstrating query planning, optimisation, and the Volcano iterator execution model.

Aggregate and LIMIT semantics

Case Behavior
AVG of ints truncated integer division (toward zero)
Aggregate over empty input, no GROUP BY zero rows (header only) — deviates from SQL's NULL row
COUNT(col) equals COUNT(*) — the engine has no NULLs
SUM/AVG on a string column error
MIN/MAX on strings lexicographic
SUM past int range error
LIMIT 0 header-only output
OFFSET, LIMIT ALL not supported (error)

Quick Start

Prerequisites: Java 17, Maven (or use the included Maven Wrapper).

# Clone
git clone https://github.com/JinBa1/java-query-engine.git
cd java-query-engine

# Build fat JAR (engine module)
./mvnw -pl engine -DskipTests clean package

Run a query:

java -cp engine/target/cuckoodb-engine-1.0.0-jar-with-dependencies.jar \
  com.github.jinba1.cuckoodb.CuckooDB \
  database_dir input_file output_file [--max-tuples=N] [--timeout-ms=N]

Both --max-tuples and --timeout-ms are optional and independent. Omit either to impose no limit on that dimension.

Query budgets

The engine enforces total-work semantics: every tuple emitted by any operator in the tree counts against the budget, including intermediate tuples that are later filtered or joined. A cross-product explosion that never produces output rows will still hit the tuple limit. The timeout clock starts lazily at the first tuple emission.

When a budget is exceeded:

  • The partial output file is deleted.
  • Error: <message> is written to stderr.
  • The process exits with code 1.

Both flags are optional and independent — you can use one, both, or neither.

EXPLAIN

Prefix any query with EXPLAIN to inspect the query plan without executing it:

EXPLAIN SELECT Student.B, SUM(Student.C) FROM Student, Enrolled
WHERE Student.D > 30 AND Student.A = Enrolled.A
GROUP BY Student.B;

The output file receives a two-section plan:

=== Plan (as written) ===
Aggregate[group by: Student.B; calls: SUM(Student.c)]
  Project[Enrolled.A, Student.A, Student.B, Student.C, Student.D]
    Select[Student.D > 30]
      Join[Student.A = Enrolled.A]
        Scan[Student]
        Scan[Enrolled]

=== Plan (optimized) ===
Aggregate[group by: Student.B; calls: SUM(Student.c)]
  Project[Enrolled.A, Student.A, Student.B, Student.C, Student.D]
    Join[Student.A = Enrolled.A]
      Select[Student.D > 30]
        Scan[Student]
      Project[Enrolled.A]
        Scan[Enrolled]

No query execution occurs for EXPLAIN queries.

Join algorithms

The engine supports two join algorithms; the planner selects between them automatically.

Nested-loop join

JoinOperator implements a classic nested-loop join: for every outer tuple the inner child is rewound and scanned in full. It handles any join condition (equality, inequality, arbitrary expression, or cross product with no condition). EXPLAIN shows it as Join[<condition>].

Hash join

HashJoinOperator extends JoinOperator with an in-memory hash join. The inner (build) side is drained once into a HashMap keyed by the equality conjuncts; the outer (probe) side then streams through once. After a hash-table lookup, the full original condition is re-evaluated on every candidate, so residual non-equality conjuncts (e.g. A.x = B.x AND A.y > 3) work correctly. Output order — outer-major, inner order preserved within each key bucket — is identical to the nested-loop join. EXPLAIN shows it as HashJoin[<condition>].

Auto-selection rule: the planner chooses hash join when Constants.useHashJoin is true (the default) and the join condition contains at least one column-to-column equality conjunct (e.g. Student.A = Enrolled.A). Cross products (no condition) and pure non-equi joins (e.g. A.x > B.y only) always use nested-loop join.

Toggle: set Constants.useHashJoin = false at program start (or in tests) to force nested-loop for all joins.

Benchmarks

Performance was measured with a JMH 1.37 benchmark suite in the bench/ package (engine/src/test/java/com/github/jinba1/cuckoodb/bench/). The suite is compiled in CI but never run there; run it locally with:

./mvnw -pl engine -q test-compile exec:exec -Dexec.executable=java -Dexec.classpathScope=test \
  "-Dexec.args=-cp %classpath org.openjdk.jmh.Main .*Benchmark"

Results (OpenJDK 21.0.5, Intel Core i9-13900HX, 32 logical cores, Linux under WSL2):

Benchmark matchesPerKey rowsPerSide useHashJoin Mode Cnt Score Error Units
EndToEndJoinBenchmark.planAndDrain N/A N/A true avgt 3 1.028 ± 0.288 ms/op
EndToEndJoinBenchmark.planAndDrain N/A N/A false avgt 3 315.523 ± 36.021 ms/op
JoinAlgorithmBenchmark.hashJoin 1 1000 N/A avgt 5 0.270 ± 0.011 ms/op
JoinAlgorithmBenchmark.hashJoin 1 5000 N/A avgt 5 1.382 ± 0.154 ms/op
JoinAlgorithmBenchmark.hashJoin 10 1000 N/A avgt 5 2.160 ± 0.109 ms/op
JoinAlgorithmBenchmark.hashJoin 10 5000 N/A avgt 5 10.661 ± 0.840 ms/op
JoinAlgorithmBenchmark.nestedLoopJoin 1 1000 N/A avgt 5 202.621 ± 25.313 ms/op
JoinAlgorithmBenchmark.nestedLoopJoin 1 5000 N/A avgt 5 5027.912 ± 370.564 ms/op
JoinAlgorithmBenchmark.nestedLoopJoin 10 1000 N/A avgt 5 194.786 ± 4.916 ms/op
JoinAlgorithmBenchmark.nestedLoopJoin 10 5000 N/A avgt 5 4785.620 ± 212.683 ms/op

EndToEndJoinBenchmark joins two 1 000-row CSV tables through the full planner pipeline; nested-loop re-parses the inner CSV once per outer row, so the gap (≈ 307×) reflects both the algorithmic difference and I/O cost. JoinAlgorithmBenchmark uses in-memory CachedOperator inputs to isolate the join algorithm itself; at 5 000 rows/side the operator-level gap is ≈ 3 600×.

Benchmarks are compiled in CI but never executed there.

Demo

Input table (engine/samples/db/data/Student.csv):

A, B, C, D
1, 200, 50, 33
2, 200, 200, 44
3, 100, 105, 44
4, 100, 50, 11
5, 100, 500, 22
6, 300, 400, 11

Query (engine/samples/input/query4.sql):

SELECT * FROM Student WHERE Student.A < 3;

Command:

java -cp engine/target/cuckoodb-engine-1.0.0-jar-with-dependencies.jar \
  com.github.jinba1.cuckoodb.CuckooDB \
  engine/samples/db engine/samples/input/query4.sql output.csv

To limit resource usage, add optional budget flags:

java -cp engine/target/cuckoodb-engine-1.0.0-jar-with-dependencies.jar \
  com.github.jinba1.cuckoodb.CuckooDB \
  engine/samples/db engine/samples/input/query4.sql output.csv --max-tuples=10000 --timeout-ms=5000

Output (output.csv):

a,b,c,d
1,200,50,33
2,200,200,44

Running Examples

The engine/samples/ directory ships with 20 queries and a small dataset (Student, Course, Enrolled, Staff tables). Expected output lives in engine/samples/expected_output/.

Run all 20 through the bundled runner, which diffs each result against the expected output and reports pass/fail. It is launched via exec:exec (not exec:java) so it runs with the engine module as the working directory — exec:java would keep the working directory at the reactor root and fail to find samples/:

./mvnw -pl engine -q test-compile exec:exec -Dexec.executable=java -Dexec.classpathScope=test \
  "-Dexec.args=-cp %classpath com.github.jinba1.cuckoodb.SampleQueryRunner"

Or run each query through the CLI and diff manually:

# Run all sample queries and diff against expected output
for i in $(seq 1 20); do
  java -cp engine/target/cuckoodb-engine-1.0.0-jar-with-dependencies.jar \
    com.github.jinba1.cuckoodb.CuckooDB \
    engine/samples/db "engine/samples/input/query${i}.sql" "/tmp/out${i}.csv"
  diff "engine/samples/expected_output/query${i}.csv" "/tmp/out${i}.csv" && echo "query${i}: OK"
done

Testing

./mvnw test

The test suite covers individual operators, the query planner, the optimiser, expression evaluation, query budgets, EXPLAIN, hash join, and end-to-end integration scenarios (339 tests).

Project Structure

├── pom.xml                                          # Parent POM (aggregator: engine + server; Java 17, dep/plugin management)
├── engine/                                          # Pure query engine — zero Spring dependencies
│   ├── pom.xml                                      # cuckoodb-engine (JSqlParser 4.7, commons-csv 1.14.1, JMH 1.37 test-scope)
│   ├── src/main/java/com/github/jinba1/cuckoodb/    # Core engine (35 files)
│   │   └── operator/                                # Volcano operators (11 files, incl. HashJoinOperator)
│   ├── src/test/java/com/github/jinba1/cuckoodb/    # JUnit 5 tests (339 tests across 33 files)
│   └── samples/
│       ├── db/data/                                 # CSV data files (header row + data rows)
│       ├── input/query[1-20].sql                    # Sample queries
│       └── expected_output/query[1-20].csv          # Expected results
├── server/                                          # REST gateway skeleton — Spring Boot REST gateway planned
│   ├── pom.xml                                      # cuckoodb-server (depends on cuckoodb-engine; no Spring yet)
│   └── src/main/java/com/github/jinba1/cuckoodb/server/
├── mvnw / mvnw.cmd                                  # Maven Wrapper
└── LICENSE

Background

Originally built as a university project for the Advanced Database Systems course at the University of Edinburgh, subsequently extended with additional query optimisation and expanded test coverage.

License

This project is released under the MIT License. See LICENSE for details.

About

In-memory relational query engine built on the Volcano/iterator model. SQL parsing, query planning and optimisation in Java.

Topics

Resources

License

Stars

Watchers

Forks

Contributors

Languages