fix: improve FM index query performance#7507
Conversation
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
| ) -> Result<()> { | ||
| let texts = std::mem::replace(partition, Vec::with_capacity(max_rows.min(PARTITION_SIZE))); | ||
| *partition_bytes = 0; | ||
| if let Some(file) = completed_files.remove(&partition_id) { |
There was a problem hiding this comment.
Reusing an existing partition solely by partition_id can publish stale data. If a retry uses the same UUID after the input or partition sizing changed, this drops the freshly built texts and returns the old file; the loader later scans all part_*_fm.lance files in the directory, so stale partitions can still be included in exact contains results.
|
|
||
| #[arg( | ||
| long, | ||
| default_value = "az://datasets/mmlb/mmlb_100m_fts_en_fm_20260626.lance" |
There was a problem hiding this comment.
This write-capable tool defaults to a real Azure dataset. With ambient credentials, running fm_index_tool drop or create without --uri can delete or replace the shared index because those actions call drop_index / .replace(true).
Summary:
containsqueries so normal reads demand-load needed wavelet blocks and page nearby blocks instead of prewarming whole partitions.containsbenchmark and FM index management tooling.Optimizations applied:
LANCE_FMINDEX_PREWARM_CHUNK_BYTESandLANCE_FMINDEX_PREWARM_CHUNK_CONCURRENCY.LANCE_FMINDEX_DEMAND_PAGE_BYTESto reduce object-store RPS.--index-uuidrecovery support.LANCE_FMINDEX_PARTITION_ROWS=100000, large partition-byte cap, 1,000 partitions.Performance:
Dataset: 100M-row
az://datasets/mmlb/mmlb_100m_fts_en_fm_20260626.lance.Query workload: 4 sampled 5-term patterns from
summary_in_image,contains(full_content, pattern),k=100,_rowidonly (projection=[],row_id=true).Index size:
78600545-625f-40e2-8790-204f35097ec0: 1,000 partition files, 920,064,767,792 bytes / 856.9 GiB / 0.837 TiB.