feat(java): speed up zonemap reads#6906
Conversation
Replaces per-row JNI object construction with a single Arrow IPC byte stream returned across the JNI boundary, then decoded back to List<ZoneStats> in pure Java via the new ZonemapStatsCodec. The profiled hot path drops from O(N) JNI calls (one new_object + 3 method invocations per zone, plus per-value scalar_value_to_java) to O(1) — one byte[] crossing the boundary, regardless of zone count. Tests: - New ZonemapStatsCodecTest: 9 tests on synthetic IPC streams covering empty/null bytes, null-in-non-nullable across all id columns, date/ timestamp round-trip, and multi-batch ordering. - ZonemapStatsTest extended with 6 integration tests covering VARCHAR, FLOAT8, BIT, DATEDAY, non-TZ TIMESTAMP, and all-null zones (was IntVector-only). Change-Id: If0680e1be67c9c9323d7c6d34b8727f34a599068
| ArrowStreamReader reader = new ArrowStreamReader(in, allocator)) { | ||
| while (reader.loadNextBatch()) { | ||
| VectorSchemaRoot root = reader.getVectorSchemaRoot(); | ||
| FieldVector fragmentIdVec = root.getVector("fragment_id"); |
There was a problem hiding this comment.
The decoder trusts the IPC schema and coerces structural fields from any numeric vector, so malformed or version-skewed zonemap data can produce unclear runtime failures or incorrect stats instead of a clear invalid-index error.
There was a problem hiding this comment.
Good point — addressed. The decoder now validates the IPC schema field-by-field before reading any row, against the Rust producer's contract
| * IPC byte array instead of constructing N {@code ZoneStats} JNI objects per call. The type → | ||
| * {@code Comparable} mapping mirrors the Rust {@code scalar_value_to_java} helper. | ||
| */ | ||
| public final class ZonemapStatsCodec { |
There was a problem hiding this comment.
This public codec exposes the internal zonemap IPC payload as a Java API surface, so downstream code can depend on an unversioned transport format and constrain future schema changes.
There was a problem hiding this comment.
ZonemapStatsCodec is now package-private (default visibility) and was moved from org.lance.index.scalar to org.lance so it sits next to its only caller
| public static List<ZoneStats> decode(byte[] ipcBytes, BufferAllocator allocator) | ||
| throws IOException { | ||
| if (ipcBytes == null || ipcBytes.length == 0) { | ||
| return Collections.emptyList(); |
There was a problem hiding this comment.
The empty-payload path returns an immutable list while the previous JNI path returned a mutable list, so callers that append to empty results can now fail only in the no-index or empty-index case.
There was a problem hiding this comment.
The empty/null ipcBytes branch now returns new ArrayList<>(), matching the mutability of the previous JNI path so callers that append to an empty result keep working
Change-Id: I11c5ea90f71045456b3c75b5caa676b2cc56f4ba
f8c75fa to
dd2005a
Compare
Summary
The end-to-end time for reading zonemap for 1 billion rows (even including OSS network latency) is now 1/100 faster than before.
Replaces per-row JNI object construction with a single Arrow IPC byte stream returned across the JNI boundary, then decoded back to List in pure Java via the new ZonemapStatsCodec. The profiled hot path drops from O(N) to O(1) — one byte[] crossing the boundary, regardless of zone count.
Problem
The profiled hot path is caused by O(N) JNI calls (one new_object + 3 method invocations per zone, plus per-value scalar_value_to_java)

Tests: