Skip to content

fix: improve FM index query performance#7507

Merged
jackye1995 merged 11 commits into
lance-format:mainfrom
jackye1995:jack/fix-fmindex-query-performance
Jun 29, 2026
Merged

fix: improve FM index query performance#7507
jackye1995 merged 11 commits into
lance-format:mainfrom
jackye1995:jack/fix-fmindex-query-performance

Conversation

@jackye1995

@jackye1995 jackye1995 commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

Summary:

  • Improve FM contains queries so normal reads demand-load needed wavelet blocks and page nearby blocks instead of prewarming whole partitions.
  • Add chunked explicit FM prewarm, parallel partition loading/search, and resumable partition builds.
  • Add FM contains benchmark and FM index management tooling.

Optimizations applied:

  • Parallel FM search across partitions/segments.
  • Removed query-time full-partition prewarm; explicit prewarm still fully warms the index.
  • Chunked contiguous wavelet-row prewarm with LANCE_FMINDEX_PREWARM_CHUNK_BYTES and LANCE_FMINDEX_PREWARM_CHUNK_CONCURRENCY.
  • Cold-read demand paging with LANCE_FMINDEX_DEMAND_PAGE_BYTES to reduce object-store RPS.
  • Parallelized FM metadata/partition loading.
  • Added resumable FM partition creation and explicit --index-uuid recovery support.
  • Final benchmark layout: 1 logical FM segment, LANCE_FMINDEX_PARTITION_ROWS=100000, large partition-byte cap, 1,000 partitions.

Performance:
Dataset: 100M-row az://datasets/mmlb/mmlb_100m_fts_en_fm_20260626.lance.
Query workload: 4 sampled 5-term patterns from summary_in_image, contains(full_content, pattern), k=100, _rowid only (projection=[], row_id=true).

Run Index layout Prewarm Query result
Baseline 6 segments / 10k partitions 6,525s / 108.8m 1t: 1.58 qps, mean 633ms, p95 768ms; 8t: 12.48 qps, mean 320ms, p95 320ms
Demand-load + chunked prewarm 6 segments / 10k partitions 2,182s / 36.4m, 3.0x faster 1t: 5.38 qps, mean 185ms, p95 307ms; 8t: 12.12 qps, mean 326ms, p95 330ms
Single segment 1 segment / 100k-row partitions 356s / 5.94m, 18.3x faster than baseline 1t: 22.38 qps, mean 44.6ms, p95 54.4ms; 8t: 84.40 qps, mean 42.8ms, p95 47.3ms

Index size:

  • Final FM index UUID 78600545-625f-40e2-8790-204f35097ec0: 1,000 partition files, 920,064,767,792 bytes / 856.9 GiB / 0.837 TiB.
  • Previous 6-segment FM layout was 920,202,650,467 bytes / 857.0 GiB, so the final relayout is effectively size-neutral.

@github-actions github-actions Bot added A-index Vector index, linalg, tokenizer bug Something isn't working labels Jun 28, 2026
@codecov

codecov Bot commented Jun 28, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 56.38629% with 560 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance/src/bin/fm_contains_bench.rs 0.00% 379 Missing ⚠️
rust/lance/src/bin/fm_index_tool.rs 0.00% 98 Missing ⚠️
rust/lance-index/src/scalar/fmindex.rs 89.71% 45 Missing and 38 partials ⚠️

📢 Thoughts on this report? Let us know!

Comment thread rust/lance-index/src/scalar/fmindex.rs Outdated
) -> Result<()> {
let texts = std::mem::replace(partition, Vec::with_capacity(max_rows.min(PARTITION_SIZE)));
*partition_bytes = 0;
if let Some(file) = completed_files.remove(&partition_id) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reusing an existing partition solely by partition_id can publish stale data. If a retry uses the same UUID after the input or partition sizing changed, this drops the freshly built texts and returns the old file; the loader later scans all part_*_fm.lance files in the directory, so stale partitions can still be included in exact contains results.

Comment thread rust/lance/src/bin/fm_index_tool.rs Outdated

#[arg(
long,
default_value = "az://datasets/mmlb/mmlb_100m_fts_en_fm_20260626.lance"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This write-capable tool defaults to a real Azure dataset. With ambient credentials, running fm_index_tool drop or create without --uri can delete or replace the shared index because those actions call drop_index / .replace(true).

@Xuanwo Xuanwo left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@jackye1995 jackye1995 merged commit 6d02a57 into lance-format:main Jun 29, 2026
30 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-index Vector index, linalg, tokenizer bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants