feat: expose include_metadata hashing option through Python API (PLT-1734)#5
Conversation
…PLT-1734) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR exposes an include_metadata: bool = False option across the Python ArrowDigester API to optionally incorporate Arrow schema- and field-level metadata into the schema hashing phase, aligning with the Rust crate’s two-phase hashing approach while preserving hash format 0.0.1 stability by default.
Changes:
- Added Phase 2 schema hashing that (optionally) incorporates deterministically ordered schema + nested field metadata into the same SHA-256 hasher.
- Extended
ArrowDigester.__init__,hash_schema,hash_record_batch, andhash_tablewith a keyword-onlyinclude_metadataflag (defaultFalse). - Added a comprehensive new test suite and updated README + design docs describing metadata hashing behavior and invariants.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
src/starfix/arrow_digester.py |
Implements metadata collection/sorting and optional Phase 2 schema hashing; exposes include_metadata across public entry points. |
tests/test_metadata_hashing.py |
Adds coverage for determinism, invariants, and behavioral differences when metadata is included/excluded. |
README.md |
Documents how to opt into metadata hashing and the empty-metadata invariant. |
docs/metamorphic/specs/2026-06-18-include-metadata-hashing-design.md |
Captures the intended algorithm, path conventions, and API changes for the feature. |
docs/metamorphic/plans/2026-06-18-include-metadata-hashing.md |
Provides an implementation plan corresponding to the design/spec. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| from typing import TYPE_CHECKING | ||
|
|
||
| if TYPE_CHECKING: | ||
| from hashlib import _Hash | ||
|
|
||
| import pyarrow as pa |
There was a problem hiding this comment.
Fixed. Removed the from hashlib import _Hash import from the TYPE_CHECKING block and replaced it with a small module-level _Hasher Protocol that declares the single update(data: bytes | bytearray) -> None method. No private stdlib types referenced anywhere now.
Review round summaryOne change made in response to Copilot's two comments (both pointing at the same issue): Replace
|
Summary
include_metadata: bool = False(keyword-only) toArrowDigester.__init__,hash_schema,hash_record_batch, andhash_tablestarfixcrate (PLT-1733): Phase 1 hashes the structural schema unchanged; Phase 2 (wheninclude_metadata=True) feeds compact JSON of all field/schema metadata into the same hasherhash_arrayintentionally has noinclude_metadata— standalone arrays carry no schema/field metadata context, matching the Rust APIFalsepreserves hash format 0.0.1 stability — all existing golden tests pass unchangedTest plan
uv run pytest tests/ -q(129 pre-existing + 36 new intests/test_metadata_hashing.py)TestSortMetadata,TestCollectNestedFieldMetadata,TestFieldPathSorting,TestEmptyMetadataInvariant,TestFieldMetadataChangesHash,TestMetadataExcludedByDefault,TestSchemaMetadataChangesHash,TestMetadataDeterminism,TestEmptyMetadataInvariantFull,TestRoundTripinclude_metadata=Falseproduces identical output to hash format 0.0.1Closes PLT-1734
🤖 Generated with Claude Code