Skip to content

feat(mem_wal): support prefiltered LSM vector and FTS search#7138

Open
touch-of-grey wants to merge 4 commits into
lance-format:mainfrom
touch-of-grey:LsmPrefilter
Open

feat(mem_wal): support prefiltered LSM vector and FTS search#7138
touch-of-grey wants to merge 4 commits into
lance-format:mainfrom
touch-of-grey:LsmPrefilter

Conversation

@touch-of-grey

@touch-of-grey touch-of-grey commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

Adds true prefilter support for LSM vector and full-text search across base, flushed, and in-memory memtable sources.

This preserves newest-PK semantics before top-k for rewritten active rows, keeps HNSW/FTS limit pushdown for append-only PK memtables, and threads vector filters through Python and Java bindings with input validation.

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@github-actions github-actions Bot added the enhancement New feature or request label Jun 6, 2026
@github-actions github-actions Bot added A-python Python bindings A-java Java bindings + JNI labels Jun 6, 2026
@jackye1995

Copy link
Copy Markdown
Contributor

Thanks for the fix! Since we are already doing the refactoring here, I think let's also make sure the MemTable scanner is consistent with the other scanners.

@touch-of-grey touch-of-grey force-pushed the LsmPrefilter branch 2 times, most recently from edfa4f1 to e2c2698 Compare June 6, 2026 20:14
The MemWAL LSM vector and full-text search planners ignored a user WHERE
predicate, so a filtered search returned rows the same query without the
filter would exclude. Only the plain LSM scan honored filters.

Add prefilter support to both LSM search planners via with_filter(Option<Expr>):

- Base and flushed arms reuse the dataset scanner's native prefilter.
- The active/frozen memtable arms apply the predicate before the top-k cut: the
  brute-force vector exec masks rows in compute_topk (a filtered vector search
  routes to brute force rather than HNSW), and the FTS exec masks the
  materialized full-schema hits before projection.
- LsmScanner::full_text_search forwards its filter to the FTS planner.

This is a true prefilter (matching a normal filtered scan), not a lossy
post-filter on the per-source top-k.
LsmScanner gains a Scanner-aligned (owned) builder so an LSM read reads like a
normal scan. `nearest()` and `full_text_search()` become state setters and
`create_plan()` dispatches to the vector, FTS, point-lookup, or plain planner,
mirroring `Scanner::create_plan` / `MemTableScanner::create_plan`.

- Add `nearest`, `nprobes`, `refine`, `distance_metric` (vector search folds the
  LsmVectorSearchPlanner behind the builder; honors the builder filter).
- `full_text_search(FullTextSearchQuery)` is now a setter (column from the query,
  k from `limit`) instead of returning a plan directly.
- Align `project` (`<T: AsRef<str>>` + `Result`) and `limit`
  (`Option<i64>, Option<i64>` + `Result`) with `Scanner`.

Knobs the LSM planner cannot yet honor (ef, distance_range, maximum_nprobes,
with_row_id) are intentionally not exposed to avoid silently ignoring them.
@github-actions github-actions Bot added the A-deps Dependency updates label Jun 27, 2026
@jackye1995 jackye1995 changed the title feat(mem_wal): support prefilters in LSM vector and full-text search feat(mem_wal): support filtered LSM vector and FTS reads Jun 28, 2026
…h PK

The active-memtable vector and full-text search arms applied the prefilter
predicate inside the per-source exec, before the within-source dedup that
collapses an in-memtable update's duplicate-PK appends to the newest version.
When the newest version of a PK failed the predicate but an older version
passed, the filter dropped the newest and the dedup kept the stale older
match — returning a row whose current version should have been excluded. This
broke the "true prefilter == normal filtered scan" contract on the active arm
(flushed/base were already correct: the deletion vector / block-list remove
superseded rows before the filter).

Evaluate the predicate against the newest version of each PK:

- MemTableBruteForceVectorExec and FtsIndexExec take the primary-key columns and,
  when filtering, drop superseded versions (via compute_pk_hash) before applying
  the predicate, so a newer non-matching version excludes the PK.
- LsmScanner plumbs pk_columns into the active MemTableScanner arms.

Also align MemTableScanner's builder with the dataset Scanner (the API
consistency this work started from): project (AsRef + Result), limit
(Option<i64> + Result), nearest (&dyn Array + Result), and full_text_search
(FullTextSearchQuery, converted to the local query model) now match Scanner.
@jackye1995 jackye1995 changed the title feat(mem_wal): support filtered LSM vector and FTS reads feat(mem_wal): support prefiltered LSM vector and FTS search Jun 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-deps Dependency updates A-java Java bindings + JNI A-python Python bindings enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants