feat(search): abort split warmup when a required term is missing#6511
feat(search): abort split warmup when a required term is missing#6511PSeitz-dd wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 87be0f56e7
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
A leaf search warms up a split by downloading term dictionaries, posting lists, fast fields and field norms before running the query. When the query has a term that *must* match (a single-term clause reachable through `must`/`filter` only) and that term's posting list is empty in the split, the query provably matches nothing there — so the rest of the warmup is wasted work. This extracts the set of required terms from the query AST and, during warmup, cancels the remaining downloads (fast fields, field norms, other postings) as soon as a required term is found missing, returning an empty result for the split. `warm_postings` already reports term existence, so no extra lookups are added; the existence signal is simply no longer discarded. - quickwit-query: `required_terms` extraction + `build_tantivy_query_and_required_terms`. - quickwit-doc-mapper: `WarmupInfo::required_terms`. - quickwit-search: race warmup against a `CancellationToken` fired by `warm_up_terms`; abort returns a counted empty response that still reports its (aborted) download volume; new `pruned_empty_term` metric. The optimization is gated on single-segment splits (sound per split) and on exact `TermQuery` leaves on indexed fields; everything else is treated conservatively, so it never discards a query that could match. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
87be0f5 to
e489247
Compare
|
@codex review |
|
Codex Review: Didn't find any major issues. Keep it up! Reviewed commit: ℹ️ About Codex in GitHubCodex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback". |
| ) | ||
| .instrument(debug_span!("warm_up_automatons")); | ||
|
|
||
| tokio::try_join!( |
There was a problem hiding this comment.
as a follow-up, i think it'd be interesting to record how many splits where cancelled vs not, and if that ratio goes over something small (10% maybe?), run warm_up_terms_future first, and then the rest (if warm_up_terms_future didn't cancel everything already)
|
how is Object Storage Download measured? If it's self reported, i think it misrepresent the result: we record s3 download at the end of the call if it succeeded. if it died partway, i think we report 0 and not whatever was actually downloaded |
Abort split warmup when a required term is missing to not fully download fast fields.
A leaf search warms up a split by downloading term dictionaries, posting lists, fast fields and field norms before running the query. When the query has a term that must match (a single-term clause reachable through
must/filteronly) and that term's posting list is empty in the split, the query provably matches nothing there — so the rest of the warmup is wasted work. Moreover it downloads fastfields and may put unnecessary pressure on the cache.This extracts the set of required terms from the query AST and, during warmup, cancels the remaining downloads (fast fields, field norms, other postings) as soon as a required term is found missing, returning an empty result for the split.
warm_postingsalready reports term existence, which is used here.required_termsextraction +build_tantivy_query_and_required_terms.WarmupInfo::required_terms.CancellationTokenfired bywarm_up_terms; abort returns a counted empty response that still reports its (aborted) download volume; newpruned_empty_termmetric.Benchmarks
Comparing: main with ff cache (baseline), main, abort
list_7d_low_hitsis a single selective term.We can see the downloaded data is reduced, the number of get requests is unchanged (requests still run in parallel). CPU time is reduced due to reduced IO.