feat(mem_wal): support prefiltered LSM vector and FTS search#7138
Open
touch-of-grey wants to merge 4 commits into
Open
feat(mem_wal): support prefiltered LSM vector and FTS search#7138touch-of-grey wants to merge 4 commits into
touch-of-grey wants to merge 4 commits into
Conversation
9e10664 to
c95ecaf
Compare
c95ecaf to
20cdb67
Compare
Contributor
|
Thanks for the fix! Since we are already doing the refactoring here, I think let's also make sure the MemTable scanner is consistent with the other scanners. |
edfa4f1 to
e2c2698
Compare
The MemWAL LSM vector and full-text search planners ignored a user WHERE predicate, so a filtered search returned rows the same query without the filter would exclude. Only the plain LSM scan honored filters. Add prefilter support to both LSM search planners via with_filter(Option<Expr>): - Base and flushed arms reuse the dataset scanner's native prefilter. - The active/frozen memtable arms apply the predicate before the top-k cut: the brute-force vector exec masks rows in compute_topk (a filtered vector search routes to brute force rather than HNSW), and the FTS exec masks the materialized full-schema hits before projection. - LsmScanner::full_text_search forwards its filter to the FTS planner. This is a true prefilter (matching a normal filtered scan), not a lossy post-filter on the per-source top-k.
LsmScanner gains a Scanner-aligned (owned) builder so an LSM read reads like a normal scan. `nearest()` and `full_text_search()` become state setters and `create_plan()` dispatches to the vector, FTS, point-lookup, or plain planner, mirroring `Scanner::create_plan` / `MemTableScanner::create_plan`. - Add `nearest`, `nprobes`, `refine`, `distance_metric` (vector search folds the LsmVectorSearchPlanner behind the builder; honors the builder filter). - `full_text_search(FullTextSearchQuery)` is now a setter (column from the query, k from `limit`) instead of returning a plan directly. - Align `project` (`<T: AsRef<str>>` + `Result`) and `limit` (`Option<i64>, Option<i64>` + `Result`) with `Scanner`. Knobs the LSM planner cannot yet honor (ef, distance_range, maximum_nprobes, with_row_id) are intentionally not exposed to avoid silently ignoring them.
…h PK The active-memtable vector and full-text search arms applied the prefilter predicate inside the per-source exec, before the within-source dedup that collapses an in-memtable update's duplicate-PK appends to the newest version. When the newest version of a PK failed the predicate but an older version passed, the filter dropped the newest and the dedup kept the stale older match — returning a row whose current version should have been excluded. This broke the "true prefilter == normal filtered scan" contract on the active arm (flushed/base were already correct: the deletion vector / block-list remove superseded rows before the filter). Evaluate the predicate against the newest version of each PK: - MemTableBruteForceVectorExec and FtsIndexExec take the primary-key columns and, when filtering, drop superseded versions (via compute_pk_hash) before applying the predicate, so a newer non-matching version excludes the PK. - LsmScanner plumbs pk_columns into the active MemTableScanner arms. Also align MemTableScanner's builder with the dataset Scanner (the API consistency this work started from): project (AsRef + Result), limit (Option<i64> + Result), nearest (&dyn Array + Result), and full_text_search (FullTextSearchQuery, converted to the local query model) now match Scanner.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds true prefilter support for LSM vector and full-text search across base, flushed, and in-memory memtable sources.
This preserves newest-PK semantics before top-k for rewritten active rows, keeps HNSW/FTS limit pushdown for append-only PK memtables, and threads vector filters through Python and Java bindings with input validation.