Skip to content

feat(java): speed up zonemap reads#6906

Open
summaryzb wants to merge 3 commits into
lance-format:mainfrom
summaryzb:zonemap_speedup
Open

feat(java): speed up zonemap reads#6906
summaryzb wants to merge 3 commits into
lance-format:mainfrom
summaryzb:zonemap_speedup

Conversation

@summaryzb

@summaryzb summaryzb commented May 22, 2026

Copy link
Copy Markdown
Contributor

Summary

The end-to-end time for reading zonemap for 1 billion rows (even including OSS network latency) is now 1/100 faster than before.
Replaces per-row JNI object construction with a single Arrow IPC byte stream returned across the JNI boundary, then decoded back to List in pure Java via the new ZonemapStatsCodec. The profiled hot path drops from O(N) to O(1) — one byte[] crossing the boundary, regardless of zone count.

Problem

The profiled hot path is caused by O(N) JNI calls (one new_object + 3 method invocations per zone, plus per-value scalar_value_to_java)
image

Tests:

  • New ZonemapStatsCodecTest: 9 tests on synthetic IPC streams covering empty/null bytes, null-in-non-nullable across all id columns, date/ timestamp round-trip, and multi-batch ordering.
  • ZonemapStatsTest extended with 6 integration tests covering VARCHAR, FLOAT8, BIT, DATEDAY, non-TZ TIMESTAMP, and all-null zones (was IntVector-only).

Replaces per-row JNI object construction with a single Arrow IPC byte
stream returned across the JNI boundary, then decoded back to
List<ZoneStats> in pure Java via the new ZonemapStatsCodec. The profiled
hot path drops from O(N) JNI calls (one new_object + 3 method invocations
per zone, plus per-value scalar_value_to_java) to O(1) — one byte[]
crossing the boundary, regardless of zone count.

Tests:
- New ZonemapStatsCodecTest: 9 tests on synthetic IPC streams covering
  empty/null bytes, null-in-non-nullable across all id columns, date/
  timestamp round-trip, and multi-batch ordering.
- ZonemapStatsTest extended with 6 integration tests covering VARCHAR,
  FLOAT8, BIT, DATEDAY, non-TZ TIMESTAMP, and all-null zones (was
  IntVector-only).

Change-Id: If0680e1be67c9c9323d7c6d34b8727f34a599068

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@github-actions github-actions Bot added enhancement New feature or request A-java Java bindings + JNI labels May 22, 2026
@summaryzb

Copy link
Copy Markdown
Contributor Author

@Xuanwo @hamersaw PTAL

ArrowStreamReader reader = new ArrowStreamReader(in, allocator)) {
while (reader.loadNextBatch()) {
VectorSchemaRoot root = reader.getVectorSchemaRoot();
FieldVector fragmentIdVec = root.getVector("fragment_id");

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The decoder trusts the IPC schema and coerces structural fields from any numeric vector, so malformed or version-skewed zonemap data can produce unclear runtime failures or incorrect stats instead of a clear invalid-index error.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point — addressed. The decoder now validates the IPC schema field-by-field before reading any row, against the Rust producer's contract

* IPC byte array instead of constructing N {@code ZoneStats} JNI objects per call. The type →
* {@code Comparable} mapping mirrors the Rust {@code scalar_value_to_java} helper.
*/
public final class ZonemapStatsCodec {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This public codec exposes the internal zonemap IPC payload as a Java API surface, so downstream code can depend on an unversioned transport format and constrain future schema changes.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ZonemapStatsCodec is now package-private (default visibility) and was moved from org.lance.index.scalar to org.lance so it sits next to its only caller

public static List<ZoneStats> decode(byte[] ipcBytes, BufferAllocator allocator)
throws IOException {
if (ipcBytes == null || ipcBytes.length == 0) {
return Collections.emptyList();

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The empty-payload path returns an immutable list while the previous JNI path returned a mutable list, so callers that append to empty results can now fail only in the no-index or empty-index case.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The empty/null ipcBytes branch now returns new ArrayList<>(), matching the mutability of the previous JNI path so callers that append to an empty result keep working

Change-Id: I40eee4e95ca34f9d60eb25b6bd201ddff26c6654
Change-Id: I11c5ea90f71045456b3c75b5caa676b2cc56f4ba
@summaryzb summaryzb requested a review from Xuanwo June 29, 2026 04:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-java Java bindings + JNI enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants