Skip to content

fix(merge-insert): apply Delete/Fail on the indexed-scan path#7484

Merged
hamersaw merged 2 commits into
lance-format:mainfrom
hamersaw:bug/merge-insert-delete-composite-pk
Jun 29, 2026
Merged

fix(merge-insert): apply Delete/Fail on the indexed-scan path#7484
hamersaw merged 2 commits into
lance-format:mainfrom
hamersaw:bug/merge-insert-delete-composite-pk

Conversation

@hamersaw

Copy link
Copy Markdown
Contributor

Problem

An indexed merge-insert delete by a composite primary key whose columns are all indexed silently removes nothing. merge_insert(when_matched = Delete, use_index = true) reports deleted = 0, the matched rows stay live in the table, and resurface in later reads.

Root cause

WhenMatched::Delete is only implemented in the v2 plan (DeleteOnlyMergeInsertExec), which never uses a scalar index. When every join column is indexed, can_use_create_plan routes the merge to the legacy Merger instead (the only engine with the indexed-scan probe). That engine's matched-row handler only ever distinguished DoNothing from "update" — it folded both Delete and Fail into the update path:

  • a fully-indexed Delete rewrote the matched rows in place → 0 deletes;
  • a fully-indexed Fail silently updated instead of erroring.

A single indexed column hits the same path, so this was never composite-specific — composite keys just make it the common case (partial-column updates require an index on every PK column).

Fix

Make the legacy Merger dispatch every WhenMatched variant explicitly, mirroring the v2 classifier (merge_insert_action):

  • Delete collects matched row ids as deletions and emits no replacement batch.
  • Fail aborts on any match, with the same message as the v2 path.
  • A delete-only commit branch drains the merger and applies the deletions (resolving row ids → addresses via the row-id index for stable row ids) without writing fragments — this keeps the O(keys) indexed delete instead of falling back to a full table scan.
  • The partial-schema (column-rewrite) commit branch physically cannot express row deletions, so combining Delete with inserts from a partial-schema source now returns a descriptive error rather than silently dropping the deletes.

Tests

cargo test -p lance --lib dataset::write::merge_insert (152 passing), covering:

  • composite-key indexed delete (index on first / second / both / neither — the both-indexed case is the original bug);
  • single-column indexed delete;
  • multi-fragment + stable-row-id variants, including an appended fragment neither index covers (unindexed-remainder union);
  • Delete combined with InsertAll (not delete-only);
  • Fail on an indexed key (match aborts, no-match inserts cleanly).

cargo fmt and cargo clippy -p lance --lib --tests -- -D warnings are clean.

🤖 Generated with Claude Code

@github-actions github-actions Bot added the bug Something isn't working label Jun 25, 2026
@hamersaw hamersaw force-pushed the bug/merge-insert-delete-composite-pk branch from 1109fa9 to 2889963 Compare June 25, 2026 17:39
An indexed merge-insert delete by a composite primary key whose columns
are all indexed silently removed nothing: `merge_insert(when_matched =
Delete, use_index = true)` reported `deleted = 0` and the rows stayed
live, resurfacing in later reads.

`WhenMatched::Delete` is only implemented in the v2 plan
(`DeleteOnlyMergeInsertExec`), which never uses a scalar index. When
every join column is indexed, `can_use_create_plan` routes the merge to
the legacy `Merger` instead, whose matched-row handler only ever
distinguished `DoNothing` from "update" -- it folded both `Delete` and
`Fail` into the update path. So a fully-indexed delete rewrote the
matched rows in place (0 deletes), and a fully-indexed `Fail` silently
updated instead of erroring. A single indexed column hits the same path,
so this was never composite-specific.

Make the legacy `Merger` dispatch every `WhenMatched` variant explicitly,
mirroring the v2 classifier (`merge_insert_action`):
- `Delete` collects matched row ids as deletions and emits no replacement.
- `Fail` aborts on any match with the same message as the v2 path.
- A delete-only commit branch drains the merger and applies the
  deletions (resolving ids to addresses via the row-id index for stable
  row ids) without writing fragments -- keeping the O(keys) indexed
  delete rather than falling back to a full table scan.
- The partial-schema update commit branch cannot express deletions, so
  combining `Delete` with inserts from a partial-schema source now
  returns a descriptive error instead of silently dropping the deletes.

Tests cover composite and single-column indexed delete, multi-fragment
and stable-row-id variants, an unindexed-remainder fragment, delete
combined with insert, and `Fail` on an indexed key.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@hamersaw hamersaw force-pushed the bug/merge-insert-delete-composite-pk branch from 2889963 to 6bf1eee Compare June 25, 2026 18:15
@codecov

codecov Bot commented Jun 25, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 89.80716% with 37 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance/src/dataset/write/merge_insert.rs 88.88% 30 Missing and 7 partials ⚠️

📢 Thoughts on this report? Let us know!

@hamersaw hamersaw requested a review from jackye1995 June 26, 2026 16:54
Comment thread rust/lance/src/dataset/write/merge_insert.rs
Comment thread rust/lance/src/dataset/write/merge_insert.rs Outdated
…and v2

Address two review comments on the merge-insert delete paths.

A source with duplicate keys matching the same target row was handled
inconsistently by deletes: the row id was counted once per joined match while
the commit deleted it a single time (over-reported num_deleted_rows), and the
default source_dedupe_behavior::Fail — honored for updates — was silently
ignored. Make every delete engine apply the same policy as updates:

  * v1 legacy Merger: dedupe matched row ids via processed_row_ids; on a repeat,
    Fail aborts (naming the ambiguous key) and FirstSeen skips + counts a
    skipped duplicate.
  * v2 FullSchemaMergeInsertExec (Delete + InsertAll): mirror the UpdateAll arm.
    Target-only deletes from delete_not_matched_by_source share the action but
    never duplicate, so they never trip Fail.
  * v2 DeleteOnlyMergeInsertExec: thread source_dedupe_behavior + on_columns into
    collect_deletions, detect duplicates via the treemap insert, and fold the
    skipped-duplicate metric into its (previously hardcoded-0) stats.

Deletes now count each removed row once and reject ambiguous sources by default,
matching update semantics.

The second fix: a partial-schema source combining Delete with InsertAll was
forced onto the indexed-scan path (which can't fold a delete into a partial
write) and rejected as NotSupported. Keep that combination off the scalar-index
route so it falls through to the v2 plan, which fills omitted nullable columns.

Tests cover Fail/FirstSeen source-duplicate deletes on the v1 indexed path and
both v2 plans, plus a fully-indexed partial-schema delete+insert.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@hamersaw hamersaw requested a review from jackye1995 June 29, 2026 14:40
@hamersaw hamersaw merged commit 6a2a7d4 into lance-format:main Jun 29, 2026
30 checks passed
@hamersaw hamersaw deleted the bug/merge-insert-delete-composite-pk branch June 29, 2026 15:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants