feat(fts): add configurable posting block size#7466
Conversation
|
Important This PR touches the Lance format specification. Substantive changes to the format specification — the If this is a meaningful format change:
|
dd4ac88 to
d9f0acb
Compare
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
d9f0acb to
23e4810
Compare
23e4810 to
059ae90
Compare
…s-index-block-size-configurable
|
@claude reivew |
Xuanwo
left a comment
There was a problem hiding this comment.
I think this needs a compatibility boundary before merge.
-
block_size=256changes the persisted posting-block layout, but the stored index version still looks like the existing FTS format. Older readers will ignore the new metadata/details field and try to decode the blocks as legacy 128-docBitPacker4xblocks. That can fail open as wrong FTS results or decode panics instead of cleanly ignoring the index. We should either bump the FTS/index version for non-legacy block sizes or reject writing 256 until older readers can be gated out. -
Legacy segments with no
block_sizeand newly written default-128 segments withblock_size=128are semantically identical, but the multi-segment details check compares the raw protobuf values. Mixed old/new default-128 segments can be rejected as inconsistent. The comparison should canonicalize missingblock_sizeto 128 before comparing.
Feature
Linear: OSS-1344
What is the new feature?
FTS inverted index creation now accepts a
block_sizeparameter for compressed posting blocks. Supported values are128and256.Why do we need this feature?
The posting block size was previously fixed at
128, which made the block-max granularity impossible to tune for different datasets and query profiles.How does it work?
block_sizetoInvertedIndexParams, protobuf details, posting-list schema metadata, and cache headers.128as the default for newly created indexes.block_sizeas legacy128.512, with a clear validation error.BitPacker4xfor physical 128-value posting blocks andBitPacker8xfor physical 256-value posting blocks.block_size=256as experimental in public API docs because it may introduce breaking changes.block_size=128, since older wheels cannot read current-created physical 256 FTS posting blocks.Validation
cargo fmt --allcargo fmt --all --checkgit diff --checkCARGO_TARGET_DIR=/tmp/lance-target-a479-no512 cargo test -p lance-index block_size -- --nocaptureCARGO_TARGET_DIR=/tmp/lance-target-a479-no512 cargo clippy -p lance-index --tests -- -D warningsuv run make buildfrompython/uv run pytest python/tests/test_scalar_index.py::test_create_scalar_index_fts_block_sizefrompython/uv run ruff format --check python/tests/test_scalar_index.py python/lance/dataset.pyfrompython/uv run ruff check python/tests/test_scalar_index.py python/lance/dataset.pyfrompython/CARGO_TARGET_DIR=/tmp/lance-target-a479-merge-main cargo test -p lance-index block_size -- --nocaptureCARGO_TARGET_DIR=/tmp/lance-target-a479-merge-main cargo test -p lance-index test_256_posting_block_uses_single_physical_bitpack_chunk -- --nocaptureCARGO_TARGET_DIR=/tmp/lance-target-a479-merge-main cargo test -p lance-bitpackingCARGO_TARGET_DIR=/tmp/lance-target-a479-merge-main cargo clippy -p lance-bitpacking -p lance-index --tests -- -D warningsuv run ruff format --check python/tests/compat/test_scalar_indices.pyfrompython/uv run ruff check python/tests/compat/test_scalar_indices.pyfrompython/uv run pytest --run-compat -vvv -s python/tests/compat/test_scalar_indices.py::test_FtsIndex_downgrade --durations=30frompython/CARGO_TARGET_DIR=/tmp/lance-a479-target cargo test -p lance-index test_new_training_request_defaults_missing_block_size_to_128CARGO_TARGET_DIR=/tmp/lance-a479-target cargo test -p lance-index block_sizeuv run ruff format --check python/lance/dataset.pyfrompython/uv run ruff check python/lance/dataset.pyfrompython/Not run locally: Java focused test / spotless check, because this machine has no Java Runtime installed (
Unable to locate a Java Runtime).