Skip to content

[VL] Optimize Delta DV applyDeletionFilter with iterator-based bulk lookup#12395

Open
iemejia wants to merge 6 commits into
apache:mainfrom
iemejia:feature/velox-dv-bulk-roaring-filter
Open

[VL] Optimize Delta DV applyDeletionFilter with iterator-based bulk lookup#12395
iemejia wants to merge 6 commits into
apache:mainfrom
iemejia:feature/velox-dv-bulk-roaring-filter

Conversation

@iemejia

@iemejia iemejia commented Jun 29, 2026

Copy link
Copy Markdown
Member

What changes are proposed in this pull request?

Replace per-row contains() calls with an iterator-based scan in DeltaDeletionVectorReader::applyDeletionFilter().

Before: O(batch_size) calls to Roaring64Map::contains(), each performing a std::map lookup (O(log N) on the high-32-bit buckets) plus a Roaring container check per row position.

After: O(deletions_in_range) iterator advances using move_equalorlarger() to seek directly to the first deleted row in the batch range, then sequential ++iterator advances (O(1) amortized within a Roaring container) until the range end.

Benchmark results (1M row file, batches of 4096, scanning full file)

Deletion % Baseline (ms) Optimized (ms) Speedup
1% (sparse) 9.90 0.050 198x
10% (moderate) 4.06 0.398 10.2x
50% (dense) 3.82 1.99 1.9x
90% (very dense) 4.37 3.64 1.2x

For the common case of sparse deletions (1%, typical for Delta MERGE/UPDATE), this is a 198x improvement.

The baseline O(batch_size) approach spent constant time (~4ms per 1M rows) regardless of deletion density because it always probed every row. The optimized O(deletions) approach scales linearly with actual deletions.

How was this patch tested?

  • 21 C++ unit tests pass (12 existing + 9 new corner cases):
    • All rows deleted / no rows deleted
    • First and last row of batch deleted
    • Large 64-bit offsets (multi-bucket Roaring64Map)
    • Single-row batches (deleted / not deleted)
    • Dense/sparse mixed deletion patterns
    • Batch range entirely before / after all deletions
    • Partial overlap with deletion range
  • New BM_ApplyDeletionFilter benchmark added to DeltaBitmapBenchmark.cc with 5 scenarios (sparse/moderate/dense/very-dense/large-batch).

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude claude-opus-4.6

…ookup

Replace per-row contains() calls with an iterator-based scan in
DeltaDeletionVectorReader::applyDeletionFilter().

Before: O(batch_size) calls to Roaring64Map::contains(), each performing
a std::map lookup (O(log N) on the high-32-bit buckets) plus a Roaring
container check per row position.

After: O(deletions_in_range) iterator advances using move_equalorlarger()
to seek directly to the first deleted row in the batch range, then
sequential ++iterator advances (O(1) amortized within a Roaring
container) until the range end.

Benchmark results (1M row file, batches of 4096, scanning full file):

  Deletion %    Baseline (ms)    Optimized (ms)    Speedup
  1% (sparse)       9.90             0.050           198x
  10%               4.06             0.398           10.2x
  50%               3.82             1.99            1.9x
  90%               4.37             3.64            1.2x

For the common case of sparse deletions (1%, typical for Delta
MERGE/UPDATE), this is a 198x improvement. The optimization is most
effective when deletions are sparse relative to batch size, which is the
dominant production pattern.

The baseline O(batch_size) approach spent constant time (~4ms per 1M
rows) regardless of deletion density because it always probed every row.
The optimized O(deletions) approach scales linearly with actual
deletions: 0.05ms at 1%, 0.4ms at 10%, 2ms at 50%.

Tests:
- 21 unit tests pass (12 existing + 9 new corner cases covering: all
  rows deleted, no rows deleted, first/last row, large 64-bit offsets,
  single-row batches, dense/sparse mixed patterns, batch before/after
  all deletions).
- New BM_ApplyDeletionFilter benchmark added to DeltaBitmapBenchmark.cc
  with 5 scenarios (sparse/moderate/dense/very-dense/large-batch).

Assisted-by: GitHub Copilot:claude-opus-4.6
Copilot AI review requested due to automatic review settings June 29, 2026 14:37
@github-actions github-actions Bot added the VELOX label Jun 29, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes Delta Lake deletion-vector filtering in the Velox backend by switching DeltaDeletionVectorReader::applyDeletionFilter() from per-row Roaring64Map::contains() probes to an iterator-based scan over only the deletions that fall within the current batch range, improving scan performance (especially for sparse deletions). It also expands unit test coverage for corner cases and adds a microbenchmark for the hot-path behavior.

Changes:

  • Reworked DeltaDeletionVectorReader::applyDeletionFilter() to seek to the first deleted row in-range and then iterate sequentially until the batch end.
  • Added multiple new unit tests covering iterator-based corner cases (all/none deleted, boundary rows, large offsets, single-row batches, sparse/dense mixes, and non-overlapping ranges).
  • Added BM_ApplyDeletionFilter to benchmark the batch filtering path across multiple deletion densities and batch sizes.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
cpp/velox/compute/delta/DeltaDeletionVectorReader.cpp Implements iterator-based deletion lookup in applyDeletionFilter() to reduce per-row overhead.
cpp/velox/compute/delta/tests/DeltaDeletionVectorReaderTest.cpp Adds corner-case unit tests targeting the new iterator-based behavior.
cpp/velox/benchmarks/DeltaBitmapBenchmark.cc Adds a new benchmark (BM_ApplyDeletionFilter) for DV batch filtering performance.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +312 to +333
// Simulate scanning through the file in batches.
const uint64_t numBatches = totalFileRows / batchSize;
uint64_t totalDeletedFound = 0;

for (auto _ : state) {
totalDeletedFound = 0;
for (uint64_t batch = 0; batch < numBatches; ++batch) {
reader.applyDeletionFilter(batch * batchSize, batchSize, deleteBitmap);
// Count bits set to prevent dead-code elimination.
auto* raw = deleteBitmap->as<uint64_t>();
for (uint64_t w = 0; w < bits::nwords(batchSize); ++w) {
totalDeletedFound += __builtin_popcountll(raw[w]);
}
}
benchmark::DoNotOptimize(totalDeletedFound);
}

state.SetItemsProcessed(state.iterations() * totalFileRows);
state.counters["batch_size"] = benchmark::Counter(batchSize);
state.counters["deletion_pct"] = benchmark::Counter(deletionPercent);
state.counters["deleted_found"] = benchmark::Counter(totalDeletedFound);
state.counters["total_batches"] = benchmark::Counter(numBatches);

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch on both points.

Tail rows: Fixed — SetItemsProcessed now uses numBatches * batchSize (the rows actually processed) instead of totalFileRows. For the default batch size of 4096, the tail is 576 rows out of 1M (0.06%), so the benchmark results are unaffected, but the counter is now accurate.

Padding bits in popcount: This is actually safe — applyDeletionFilter calls memset(rawBitmap, 0, nbytes(size)) at the top, which zeros the entire bitmap including padding bits in the last word. Added an inline comment clarifying this.

iemejia added 2 commits June 29, 2026 18:23
Address review feedback: SetItemsProcessed now reports the actual rows
processed (numBatches * batchSize) rather than totalFileRows, which
slightly overstated the work when totalFileRows is not evenly divisible
by batchSize. Added inline comment clarifying padding bits in popcount
are safe due to memset zeroing in applyDeletionFilter.

Assisted-by: GitHub Copilot:claude-opus-4.6
Assisted-by: GitHub Copilot:claude-opus-4.6
Copilot AI review requested due to automatic review settings June 29, 2026 17:08

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

// Use an iterator-based approach instead of per-row contains() lookups.
// This is O(deletions_in_range) rather than O(batch_size), which is
// significantly faster when deletions are sparse relative to batch size.
const uint64_t rangeEnd = baseReadOffset + size;

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — added saturating arithmetic so rangeEnd clamps to UINT64_MAX instead of wrapping:

const uint64_t rangeEnd = (size <= UINT64_MAX - baseReadOffset)
    ? baseReadOffset + size : UINT64_MAX;

Also added a unit test (ApplyDeletionFilterOverflowProtection) that exercises baseReadOffset = UINT64_MAX - 50 with size = 100 to verify the guard works correctly.

Comment on lines +304 to +306
// Allocate the output bitmap buffer.
memory::MemoryManager::testingSetInstance(memory::MemoryManager::Options{});
auto pool = memory::memoryManager()->addLeafPool();

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — moved testingSetInstance out of the benchmark body into a custom main() that initializes once before running benchmarks. This mirrors the pattern in GenericBenchmark.cc:642.

iemejia added 2 commits June 29, 2026 19:41
…o main()

- Add saturating arithmetic for rangeEnd to prevent uint64_t overflow
  when baseReadOffset is near UINT64_MAX.
- Move memory::MemoryManager::testingSetInstance() from benchmark body
  to a custom main(), ensuring it runs once per process.
- Add ApplyDeletionFilterOverflowProtection unit test.

Assisted-by: GitHub Copilot:claude-opus-4.6
- Change createSerializedPayload to accept std::vector<uint64_t> to
  match RoaringBitmapArray::addSafe(uint64_t) and avoid narrowing
  conversion when passing values near UINT64_MAX.
- Update all local deletedRows vectors and loop counters to uint64_t.
- Remove unused readUint64LittleEndian() that triggered
  -Wunused-function.

Assisted-by: GitHub Copilot:claude-opus-4.6
Copilot AI review requested due to automatic review settings June 29, 2026 19:35

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

Comment on lines +383 to +402
TEST_F(DeltaDeletionVectorReaderTest, ApplyDeletionFilterOverflowProtection) {
// Verify that baseReadOffset near UINT64_MAX does not overflow rangeEnd.
const uint64_t nearMax = UINT64_MAX - 50;
auto payload = createSerializedPayload({nearMax + 10, nearMax + 20});

DeltaDeletionVectorReader reader;
reader.loadSerializedDeletionVector(payload);

// Request a batch of 100 starting at nearMax — this would overflow without
// the saturation guard (nearMax + 100 > UINT64_MAX).
auto deleteBitmap = AlignedBuffer::allocate<uint64_t>(bits::nwords(100), pool_.get());
reader.applyDeletionFilter(nearMax, 100, deleteBitmap);

auto* rawBitmap = deleteBitmap->as<uint64_t>();
// Rows at relative positions 10 and 20 should be marked as deleted.
EXPECT_TRUE(bits::isBitSet(rawBitmap, 10));
EXPECT_TRUE(bits::isBitSet(rawBitmap, 20));
EXPECT_FALSE(bits::isBitSet(rawBitmap, 0));
EXPECT_FALSE(bits::isBitSet(rawBitmap, 30));
}

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Fixed in c566737 — the test now uses low-value bitmap entries ({10, 20, 30}) that are valid under addSafe()'s kMaxRepresentableValue limit, and exercises the overflow guard purely through a high baseReadOffset (UINT64_MAX - 50 with size = 100). This verifies the saturating arithmetic in rangeEnd without needing to bypass the serialization validation.

Constructing a raw Roaring64Map payload with near-UINT64_MAX positions would let us test deletions actually present in the overflow range, but Delta's spec caps row indices well below that, so such a DV would be invalid in practice.

RoaringBitmapArray::addSafe rejects values above kMaxRepresentableValue,
so the test cannot insert row indices near UINT64_MAX. Instead, test the
overflow guard by calling applyDeletionFilter with a baseReadOffset near
UINT64_MAX on a bitmap with low-value entries, verifying no crash and
empty result.

Assisted-by: GitHub Copilot:claude-opus-4.6
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants