Skip to content

feat(index): support Utf8View patterns in the ngram TextQueryParser#7310

Open
wombatu-kun wants to merge 3 commits into
lance-format:mainfrom
wombatu-kun:feat/ngram-utf8view
Open

feat(index): support Utf8View patterns in the ngram TextQueryParser#7310
wombatu-kun wants to merge 3 commits into
lance-format:mainfrom
wombatu-kun:feat/ngram-utf8view

Conversation

@wombatu-kun

Copy link
Copy Markdown
Contributor

Follow-up to #7139.

Problem

The ngram TextQueryParser extracts string patterns and regex flags by matching only ScalarValue::Utf8 and ScalarValue::LargeUtf8. When the indexed column is Utf8View, safe_coerce_scalar (rust/lance-datafusion/src/expr.rs) coerces the predicate literal to ScalarValue::Utf8View(Some(..)), which the parser then fails to match. As a result an ngram index on a Utf8View column silently does not accelerate contains, regexp_like, or infix LIKE; they fall back to a full scan.

Change

Add the Utf8View arm to all three string-extraction sites in TextQueryParser: the contains / regexp_like pattern in visit_scalar_function, the infix-LIKE pattern in visit_like, and the regex-flags literal in apply_regex_flags. The bindings are unchanged because all three string ScalarValue variants carry an Option<String>.

Tests

  • New test_text_query_parser_utf8view: asserts contains and regexp_like over a Utf8View-typed ngram index route to the index (StringContains / Regex queries), and that infix LIKE with a Utf8View pattern is accelerated, with a Utf8 parity control.
  • Extended test_apply_regex_flags with a Utf8View flags literal.

Each test fails on the pre-change code and passes after.

Out of scope

The identical pre-existing gap in SargableQueryParser (starts_with / LIKE-prefix) is left untouched - it is a separate feature that #7139 never modified, so folding it in would be unrelated scope. It can be a separate change if desired.

This addresses the non-blocking review comment #7139 (comment) from @wjones127.

@github-actions github-actions Bot added enhancement New feature or request A-index Vector index, linalg, tokenizer labels Jun 17, 2026
@codecov

codecov Bot commented Jun 17, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 93.33333% with 2 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance-index/src/scalar/expression.rs 93.33% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

@wombatu-kun wombatu-kun force-pushed the feat/ngram-utf8view branch 2 times, most recently from 21ea955 to 84ec6cf Compare June 18, 2026 14:46
wombatu-kun pushed a commit to wombatu-kun/lance that referenced this pull request Jun 18, 2026
Follow-up to lance-format#7310, which added Utf8View handling to the ngram TextQueryParser and explicitly left the identical gap in SargableQueryParser out of scope. The BTree/ZoneMap parser only matched Utf8 / LargeUtf8 for starts_with and infix-free LIKE prefixes, so a Utf8View predicate literal was dropped and the query silently fell back to a full scan instead of using the scalar index.

Unlike the ngram path (where the pattern is only ever used as a regex string), here the parser emits a SargableQuery::LikePrefix whose ScalarValue flows downstream into the BTree, which compares the query bound against Utf8 page statistics with Arrow's type-dispatched comparator. A Utf8View bound cannot be compared against Utf8 stats arrays. Because Lance already normalizes Utf8View columns to Utf8 at write time (the stored index data is always Utf8), the fix normalizes a Utf8View prefix to Utf8 in the parser rather than threading a new type through the shared comparison code.

Adds test_sargable_query_parser_utf8view, which exercises visit_scalar_function (starts_with) and visit_like directly with Utf8View literals and asserts the resulting LikePrefix(Utf8) query, with a Utf8 parity control. The test fails on the pre-change parser (the Utf8View literal is dropped) and passes after.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
let index_info = MockIndexInfoProvider::new(vec![(
"color",
ColInfo::new(
DataType::Utf8View,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The regression test models an ngram index on a Utf8View field, but the real NGram plugin still rejects Utf8View fields during index creation. This records an unreachable production contract and can hide regressions in the actual query path.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done db595b8. Verified that no shipped path (native filter or DataFusion pushdown) delivers a Utf8View literal here - the column normalizes to Utf8 and the contains/regexp_like pattern is coerced to it - so that arm was unreachable; dropped it and the mock that modeled an impossible Utf8View-typed index. Kept Utf8View only on the uncoerced visit_like / regex-flags paths (still reachable via a filter built programmatically and pushed through scan.filter_expr), now covered by a direct visit_like test.

@wombatu-kun wombatu-kun requested a review from Xuanwo June 21, 2026 09:38
// The infix-LIKE pattern reaches `visit_like` uncoerced (verbatim from the
// `Expr`), so a programmatically-built filter passed through
// `scan.filter_expr` can carry a `Utf8View` literal that the parser must
// still match - the same defensive contract #7139 / #7351 established for

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test comment says the Utf8View LIKE behavior follows an already established BTree parser contract, but that contract is not present in this tree. This can mislead future changes by implying Utf8View LIKE-prefix support exists outside the ngram parser.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done 25efdf7 - removed the BTree SargableQueryParser reference

@wombatu-kun wombatu-kun force-pushed the feat/ngram-utf8view branch from 25efdf7 to 4cfc0b2 Compare June 21, 2026 12:22
@wombatu-kun wombatu-kun requested a review from Xuanwo June 21, 2026 12:23
Xuanwo pushed a commit that referenced this pull request Jun 24, 2026
Follow-up to #7310.

## Problem

#7310 added Utf8View support to the ngram TextQueryParser and noted the
identical pre-existing gap in SargableQueryParser (starts_with /
LIKE-prefix) as out of scope. The BTree/ZoneMap parser extracts string
prefixes by matching only ScalarValue::Utf8 and ScalarValue::LargeUtf8.
When the predicate literal is coerced to ScalarValue::Utf8View, the
parser drops it, so starts_with(col, 'x') and col LIKE 'x%' do not use
the scalar index and fall back to a full scan.

## Change

SargableQueryParser::visit_scalar_function (starts_with) and visit_like
now accept a Utf8View literal/pattern and normalize the extracted prefix
to ScalarValue::Utf8.

Normalize rather than preserve the variant: unlike the ngram path (where
the pattern is just a regex string), the SargableQueryParser emits a
SargableQuery::LikePrefix whose ScalarValue flows into the BTree. Page
pruning (pages_between) compares the query bound against Utf8
page-statistics arrays with Arrow's type-dispatched make_comparator,
which rejects a Utf8View bound vs Utf8 stats ("Can't compare arrays of
different types"). Lance already normalizes Utf8View columns to Utf8 at
write time, so the stored index data is always Utf8; normalizing the
prefix to Utf8 matches that and needs no changes to the shared
comparison code.

## Tests

New test_sargable_query_parser_utf8view exercises visit_scalar_function
(starts_with) and visit_like directly with Utf8View literals/patterns,
asserting the emitted LikePrefix(Utf8) query and recheck behavior, plus
a Utf8 parity control. It fails on the pre-change parser and passes
after.

---------

Co-authored-by: Vova Kolmakov <wombatukun@apache.org>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@wombatu-kun wombatu-kun force-pushed the feat/ngram-utf8view branch from 4cfc0b2 to 0153800 Compare June 24, 2026 09:13
Vova Kolmakov and others added 3 commits June 29, 2026 08:34
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The ngram contains / regexp_like path coerces its pattern argument to the indexed column's type via maybe_scalar, and an ngram-indexed column is always Utf8 (Lance normalizes Utf8View to Utf8 at write time), so the coerced literal is never Utf8View there. The Utf8View arm in visit_scalar_function was unreachable, and the regression test only reached it by declaring a mock column as Utf8View - an index that cannot exist.

Drop the dead Utf8View arm from visit_scalar_function. Keep Utf8View handling on the LIKE pattern (visit_like) and regex flags (apply_regex_flags), which take their literal verbatim from the Expr and so can carry a Utf8View literal from a filter built programmatically and pushed through scan.filter_expr - the same defensive contract lance-format#7139 / lance-format#7351 established for the BTree SargableQueryParser. Replace the mock-driven test with a direct visit_like check fed a Utf8View pattern, with a Utf8 parity control.
@wombatu-kun wombatu-kun force-pushed the feat/ngram-utf8view branch from 0153800 to 753175b Compare June 29, 2026 01:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-index Vector index, linalg, tokenizer enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants