[VL] Optimize Delta DV applyDeletionFilter with iterator-based bulk lookup#12395
[VL] Optimize Delta DV applyDeletionFilter with iterator-based bulk lookup#12395iemejia wants to merge 6 commits into
Conversation
…ookup Replace per-row contains() calls with an iterator-based scan in DeltaDeletionVectorReader::applyDeletionFilter(). Before: O(batch_size) calls to Roaring64Map::contains(), each performing a std::map lookup (O(log N) on the high-32-bit buckets) plus a Roaring container check per row position. After: O(deletions_in_range) iterator advances using move_equalorlarger() to seek directly to the first deleted row in the batch range, then sequential ++iterator advances (O(1) amortized within a Roaring container) until the range end. Benchmark results (1M row file, batches of 4096, scanning full file): Deletion % Baseline (ms) Optimized (ms) Speedup 1% (sparse) 9.90 0.050 198x 10% 4.06 0.398 10.2x 50% 3.82 1.99 1.9x 90% 4.37 3.64 1.2x For the common case of sparse deletions (1%, typical for Delta MERGE/UPDATE), this is a 198x improvement. The optimization is most effective when deletions are sparse relative to batch size, which is the dominant production pattern. The baseline O(batch_size) approach spent constant time (~4ms per 1M rows) regardless of deletion density because it always probed every row. The optimized O(deletions) approach scales linearly with actual deletions: 0.05ms at 1%, 0.4ms at 10%, 2ms at 50%. Tests: - 21 unit tests pass (12 existing + 9 new corner cases covering: all rows deleted, no rows deleted, first/last row, large 64-bit offsets, single-row batches, dense/sparse mixed patterns, batch before/after all deletions). - New BM_ApplyDeletionFilter benchmark added to DeltaBitmapBenchmark.cc with 5 scenarios (sparse/moderate/dense/very-dense/large-batch). Assisted-by: GitHub Copilot:claude-opus-4.6
There was a problem hiding this comment.
Pull request overview
This PR optimizes Delta Lake deletion-vector filtering in the Velox backend by switching DeltaDeletionVectorReader::applyDeletionFilter() from per-row Roaring64Map::contains() probes to an iterator-based scan over only the deletions that fall within the current batch range, improving scan performance (especially for sparse deletions). It also expands unit test coverage for corner cases and adds a microbenchmark for the hot-path behavior.
Changes:
- Reworked
DeltaDeletionVectorReader::applyDeletionFilter()to seek to the first deleted row in-range and then iterate sequentially until the batch end. - Added multiple new unit tests covering iterator-based corner cases (all/none deleted, boundary rows, large offsets, single-row batches, sparse/dense mixes, and non-overlapping ranges).
- Added
BM_ApplyDeletionFilterto benchmark the batch filtering path across multiple deletion densities and batch sizes.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| cpp/velox/compute/delta/DeltaDeletionVectorReader.cpp | Implements iterator-based deletion lookup in applyDeletionFilter() to reduce per-row overhead. |
| cpp/velox/compute/delta/tests/DeltaDeletionVectorReaderTest.cpp | Adds corner-case unit tests targeting the new iterator-based behavior. |
| cpp/velox/benchmarks/DeltaBitmapBenchmark.cc | Adds a new benchmark (BM_ApplyDeletionFilter) for DV batch filtering performance. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Simulate scanning through the file in batches. | ||
| const uint64_t numBatches = totalFileRows / batchSize; | ||
| uint64_t totalDeletedFound = 0; | ||
|
|
||
| for (auto _ : state) { | ||
| totalDeletedFound = 0; | ||
| for (uint64_t batch = 0; batch < numBatches; ++batch) { | ||
| reader.applyDeletionFilter(batch * batchSize, batchSize, deleteBitmap); | ||
| // Count bits set to prevent dead-code elimination. | ||
| auto* raw = deleteBitmap->as<uint64_t>(); | ||
| for (uint64_t w = 0; w < bits::nwords(batchSize); ++w) { | ||
| totalDeletedFound += __builtin_popcountll(raw[w]); | ||
| } | ||
| } | ||
| benchmark::DoNotOptimize(totalDeletedFound); | ||
| } | ||
|
|
||
| state.SetItemsProcessed(state.iterations() * totalFileRows); | ||
| state.counters["batch_size"] = benchmark::Counter(batchSize); | ||
| state.counters["deletion_pct"] = benchmark::Counter(deletionPercent); | ||
| state.counters["deleted_found"] = benchmark::Counter(totalDeletedFound); | ||
| state.counters["total_batches"] = benchmark::Counter(numBatches); |
There was a problem hiding this comment.
Good catch on both points.
Tail rows: Fixed — SetItemsProcessed now uses numBatches * batchSize (the rows actually processed) instead of totalFileRows. For the default batch size of 4096, the tail is 576 rows out of 1M (0.06%), so the benchmark results are unaffected, but the counter is now accurate.
Padding bits in popcount: This is actually safe — applyDeletionFilter calls memset(rawBitmap, 0, nbytes(size)) at the top, which zeros the entire bitmap including padding bits in the last word. Added an inline comment clarifying this.
Address review feedback: SetItemsProcessed now reports the actual rows processed (numBatches * batchSize) rather than totalFileRows, which slightly overstated the work when totalFileRows is not evenly divisible by batchSize. Added inline comment clarifying padding bits in popcount are safe due to memset zeroing in applyDeletionFilter. Assisted-by: GitHub Copilot:claude-opus-4.6
Assisted-by: GitHub Copilot:claude-opus-4.6
| // Use an iterator-based approach instead of per-row contains() lookups. | ||
| // This is O(deletions_in_range) rather than O(batch_size), which is | ||
| // significantly faster when deletions are sparse relative to batch size. | ||
| const uint64_t rangeEnd = baseReadOffset + size; |
There was a problem hiding this comment.
Fixed — added saturating arithmetic so rangeEnd clamps to UINT64_MAX instead of wrapping:
const uint64_t rangeEnd = (size <= UINT64_MAX - baseReadOffset)
? baseReadOffset + size : UINT64_MAX;Also added a unit test (ApplyDeletionFilterOverflowProtection) that exercises baseReadOffset = UINT64_MAX - 50 with size = 100 to verify the guard works correctly.
| // Allocate the output bitmap buffer. | ||
| memory::MemoryManager::testingSetInstance(memory::MemoryManager::Options{}); | ||
| auto pool = memory::memoryManager()->addLeafPool(); |
There was a problem hiding this comment.
Good catch — moved testingSetInstance out of the benchmark body into a custom main() that initializes once before running benchmarks. This mirrors the pattern in GenericBenchmark.cc:642.
…o main() - Add saturating arithmetic for rangeEnd to prevent uint64_t overflow when baseReadOffset is near UINT64_MAX. - Move memory::MemoryManager::testingSetInstance() from benchmark body to a custom main(), ensuring it runs once per process. - Add ApplyDeletionFilterOverflowProtection unit test. Assisted-by: GitHub Copilot:claude-opus-4.6
- Change createSerializedPayload to accept std::vector<uint64_t> to match RoaringBitmapArray::addSafe(uint64_t) and avoid narrowing conversion when passing values near UINT64_MAX. - Update all local deletedRows vectors and loop counters to uint64_t. - Remove unused readUint64LittleEndian() that triggered -Wunused-function. Assisted-by: GitHub Copilot:claude-opus-4.6
| TEST_F(DeltaDeletionVectorReaderTest, ApplyDeletionFilterOverflowProtection) { | ||
| // Verify that baseReadOffset near UINT64_MAX does not overflow rangeEnd. | ||
| const uint64_t nearMax = UINT64_MAX - 50; | ||
| auto payload = createSerializedPayload({nearMax + 10, nearMax + 20}); | ||
|
|
||
| DeltaDeletionVectorReader reader; | ||
| reader.loadSerializedDeletionVector(payload); | ||
|
|
||
| // Request a batch of 100 starting at nearMax — this would overflow without | ||
| // the saturation guard (nearMax + 100 > UINT64_MAX). | ||
| auto deleteBitmap = AlignedBuffer::allocate<uint64_t>(bits::nwords(100), pool_.get()); | ||
| reader.applyDeletionFilter(nearMax, 100, deleteBitmap); | ||
|
|
||
| auto* rawBitmap = deleteBitmap->as<uint64_t>(); | ||
| // Rows at relative positions 10 and 20 should be marked as deleted. | ||
| EXPECT_TRUE(bits::isBitSet(rawBitmap, 10)); | ||
| EXPECT_TRUE(bits::isBitSet(rawBitmap, 20)); | ||
| EXPECT_FALSE(bits::isBitSet(rawBitmap, 0)); | ||
| EXPECT_FALSE(bits::isBitSet(rawBitmap, 30)); | ||
| } |
There was a problem hiding this comment.
Good catch. Fixed in c566737 — the test now uses low-value bitmap entries ({10, 20, 30}) that are valid under addSafe()'s kMaxRepresentableValue limit, and exercises the overflow guard purely through a high baseReadOffset (UINT64_MAX - 50 with size = 100). This verifies the saturating arithmetic in rangeEnd without needing to bypass the serialization validation.
Constructing a raw Roaring64Map payload with near-UINT64_MAX positions would let us test deletions actually present in the overflow range, but Delta's spec caps row indices well below that, so such a DV would be invalid in practice.
RoaringBitmapArray::addSafe rejects values above kMaxRepresentableValue, so the test cannot insert row indices near UINT64_MAX. Instead, test the overflow guard by calling applyDeletionFilter with a baseReadOffset near UINT64_MAX on a bitmap with low-value entries, verifying no crash and empty result. Assisted-by: GitHub Copilot:claude-opus-4.6
What changes are proposed in this pull request?
Replace per-row
contains()calls with an iterator-based scan inDeltaDeletionVectorReader::applyDeletionFilter().Before: O(batch_size) calls to
Roaring64Map::contains(), each performing astd::maplookup (O(log N) on the high-32-bit buckets) plus a Roaring container check per row position.After: O(deletions_in_range) iterator advances using
move_equalorlarger()to seek directly to the first deleted row in the batch range, then sequential++iteratoradvances (O(1) amortized within a Roaring container) until the range end.Benchmark results (1M row file, batches of 4096, scanning full file)
For the common case of sparse deletions (1%, typical for Delta MERGE/UPDATE), this is a 198x improvement.
The baseline O(batch_size) approach spent constant time (~4ms per 1M rows) regardless of deletion density because it always probed every row. The optimized O(deletions) approach scales linearly with actual deletions.
How was this patch tested?
BM_ApplyDeletionFilterbenchmark added toDeltaBitmapBenchmark.ccwith 5 scenarios (sparse/moderate/dense/very-dense/large-batch).Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude claude-opus-4.6