feat(index): support Utf8View patterns in the ngram TextQueryParser#7310
feat(index): support Utf8View patterns in the ngram TextQueryParser#7310wombatu-kun wants to merge 3 commits into
Conversation
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
21ea955 to
84ec6cf
Compare
Follow-up to lance-format#7310, which added Utf8View handling to the ngram TextQueryParser and explicitly left the identical gap in SargableQueryParser out of scope. The BTree/ZoneMap parser only matched Utf8 / LargeUtf8 for starts_with and infix-free LIKE prefixes, so a Utf8View predicate literal was dropped and the query silently fell back to a full scan instead of using the scalar index. Unlike the ngram path (where the pattern is only ever used as a regex string), here the parser emits a SargableQuery::LikePrefix whose ScalarValue flows downstream into the BTree, which compares the query bound against Utf8 page statistics with Arrow's type-dispatched comparator. A Utf8View bound cannot be compared against Utf8 stats arrays. Because Lance already normalizes Utf8View columns to Utf8 at write time (the stored index data is always Utf8), the fix normalizes a Utf8View prefix to Utf8 in the parser rather than threading a new type through the shared comparison code. Adds test_sargable_query_parser_utf8view, which exercises visit_scalar_function (starts_with) and visit_like directly with Utf8View literals and asserts the resulting LikePrefix(Utf8) query, with a Utf8 parity control. The test fails on the pre-change parser (the Utf8View literal is dropped) and passes after. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
| let index_info = MockIndexInfoProvider::new(vec![( | ||
| "color", | ||
| ColInfo::new( | ||
| DataType::Utf8View, |
There was a problem hiding this comment.
The regression test models an ngram index on a Utf8View field, but the real NGram plugin still rejects Utf8View fields during index creation. This records an unreachable production contract and can hide regressions in the actual query path.
There was a problem hiding this comment.
Done db595b8. Verified that no shipped path (native filter or DataFusion pushdown) delivers a Utf8View literal here - the column normalizes to Utf8 and the contains/regexp_like pattern is coerced to it - so that arm was unreachable; dropped it and the mock that modeled an impossible Utf8View-typed index. Kept Utf8View only on the uncoerced visit_like / regex-flags paths (still reachable via a filter built programmatically and pushed through scan.filter_expr), now covered by a direct visit_like test.
| // The infix-LIKE pattern reaches `visit_like` uncoerced (verbatim from the | ||
| // `Expr`), so a programmatically-built filter passed through | ||
| // `scan.filter_expr` can carry a `Utf8View` literal that the parser must | ||
| // still match - the same defensive contract #7139 / #7351 established for |
There was a problem hiding this comment.
The test comment says the Utf8View LIKE behavior follows an already established BTree parser contract, but that contract is not present in this tree. This can mislead future changes by implying Utf8View LIKE-prefix support exists outside the ngram parser.
There was a problem hiding this comment.
Done 25efdf7 - removed the BTree SargableQueryParser reference
25efdf7 to
4cfc0b2
Compare
Follow-up to #7310. ## Problem #7310 added Utf8View support to the ngram TextQueryParser and noted the identical pre-existing gap in SargableQueryParser (starts_with / LIKE-prefix) as out of scope. The BTree/ZoneMap parser extracts string prefixes by matching only ScalarValue::Utf8 and ScalarValue::LargeUtf8. When the predicate literal is coerced to ScalarValue::Utf8View, the parser drops it, so starts_with(col, 'x') and col LIKE 'x%' do not use the scalar index and fall back to a full scan. ## Change SargableQueryParser::visit_scalar_function (starts_with) and visit_like now accept a Utf8View literal/pattern and normalize the extracted prefix to ScalarValue::Utf8. Normalize rather than preserve the variant: unlike the ngram path (where the pattern is just a regex string), the SargableQueryParser emits a SargableQuery::LikePrefix whose ScalarValue flows into the BTree. Page pruning (pages_between) compares the query bound against Utf8 page-statistics arrays with Arrow's type-dispatched make_comparator, which rejects a Utf8View bound vs Utf8 stats ("Can't compare arrays of different types"). Lance already normalizes Utf8View columns to Utf8 at write time, so the stored index data is always Utf8; normalizing the prefix to Utf8 matches that and needs no changes to the shared comparison code. ## Tests New test_sargable_query_parser_utf8view exercises visit_scalar_function (starts_with) and visit_like directly with Utf8View literals/patterns, asserting the emitted LikePrefix(Utf8) query and recheck behavior, plus a Utf8 parity control. It fails on the pre-change parser and passes after. --------- Co-authored-by: Vova Kolmakov <wombatukun@apache.org> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
4cfc0b2 to
0153800
Compare
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The ngram contains / regexp_like path coerces its pattern argument to the indexed column's type via maybe_scalar, and an ngram-indexed column is always Utf8 (Lance normalizes Utf8View to Utf8 at write time), so the coerced literal is never Utf8View there. The Utf8View arm in visit_scalar_function was unreachable, and the regression test only reached it by declaring a mock column as Utf8View - an index that cannot exist. Drop the dead Utf8View arm from visit_scalar_function. Keep Utf8View handling on the LIKE pattern (visit_like) and regex flags (apply_regex_flags), which take their literal verbatim from the Expr and so can carry a Utf8View literal from a filter built programmatically and pushed through scan.filter_expr - the same defensive contract lance-format#7139 / lance-format#7351 established for the BTree SargableQueryParser. Replace the mock-driven test with a direct visit_like check fed a Utf8View pattern, with a Utf8 parity control.
0153800 to
753175b
Compare
Follow-up to #7139.
Problem
The ngram
TextQueryParserextracts string patterns and regex flags by matching onlyScalarValue::Utf8andScalarValue::LargeUtf8. When the indexed column isUtf8View,safe_coerce_scalar(rust/lance-datafusion/src/expr.rs) coerces the predicate literal toScalarValue::Utf8View(Some(..)), which the parser then fails to match. As a result an ngram index on aUtf8Viewcolumn silently does not acceleratecontains,regexp_like, or infixLIKE; they fall back to a full scan.Change
Add the
Utf8Viewarm to all three string-extraction sites inTextQueryParser: thecontains/regexp_likepattern invisit_scalar_function, the infix-LIKE pattern invisit_like, and the regex-flags literal inapply_regex_flags. The bindings are unchanged because all three stringScalarValuevariants carry anOption<String>.Tests
test_text_query_parser_utf8view: assertscontainsandregexp_likeover aUtf8View-typed ngram index route to the index (StringContains/Regexqueries), and that infixLIKEwith aUtf8Viewpattern is accelerated, with aUtf8parity control.test_apply_regex_flagswith aUtf8Viewflags literal.Each test fails on the pre-change code and passes after.
Out of scope
The identical pre-existing gap in
SargableQueryParser(starts_with/ LIKE-prefix) is left untouched - it is a separate feature that #7139 never modified, so folding it in would be unrelated scope. It can be a separate change if desired.This addresses the non-blocking review comment #7139 (comment) from @wjones127.