Skip to content

feat: expose include_metadata hashing option through Python API (PLT-1734)#5

Merged
eywalker merged 10 commits into
mainfrom
eywalker/plt-1734-starfix-python-expose-schemafield-metadata-hashing-option
Jun 18, 2026
Merged

feat: expose include_metadata hashing option through Python API (PLT-1734)#5
eywalker merged 10 commits into
mainfrom
eywalker/plt-1734-starfix-python-expose-schemafield-metadata-hashing-option

Conversation

@kurodo3

@kurodo3 kurodo3 Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Adds include_metadata: bool = False (keyword-only) to ArrowDigester.__init__, hash_schema, hash_record_batch, and hash_table
  • Implements the same two-phase SHA-256 algorithm as the Rust starfix crate (PLT-1733): Phase 1 hashes the structural schema unchanged; Phase 2 (when include_metadata=True) feeds compact JSON of all field/schema metadata into the same hasher
  • hash_array intentionally has no include_metadata — standalone arrays carry no schema/field metadata context, matching the Rust API
  • Default False preserves hash format 0.0.1 stability — all existing golden tests pass unchanged
  • Empty-metadata invariant: a schema with no metadata produces the same hash regardless of the flag

Test plan

  • All 165 tests pass: uv run pytest tests/ -q (129 pre-existing + 36 new in tests/test_metadata_hashing.py)
  • New test classes cover: TestSortMetadata, TestCollectNestedFieldMetadata, TestFieldPathSorting, TestEmptyMetadataInvariant, TestFieldMetadataChangesHash, TestMetadataExcludedByDefault, TestSchemaMetadataChangesHash, TestMetadataDeterminism, TestEmptyMetadataInvariantFull, TestRoundTrip
  • Golden value regression test confirms include_metadata=False produces identical output to hash format 0.0.1
  • Cross-language byte-for-byte parity with Rust deferred to PLT-1735

Closes PLT-1734

🤖 Generated with Claude Code

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR exposes an include_metadata: bool = False option across the Python ArrowDigester API to optionally incorporate Arrow schema- and field-level metadata into the schema hashing phase, aligning with the Rust crate’s two-phase hashing approach while preserving hash format 0.0.1 stability by default.

Changes:

  • Added Phase 2 schema hashing that (optionally) incorporates deterministically ordered schema + nested field metadata into the same SHA-256 hasher.
  • Extended ArrowDigester.__init__, hash_schema, hash_record_batch, and hash_table with a keyword-only include_metadata flag (default False).
  • Added a comprehensive new test suite and updated README + design docs describing metadata hashing behavior and invariants.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/starfix/arrow_digester.py Implements metadata collection/sorting and optional Phase 2 schema hashing; exposes include_metadata across public entry points.
tests/test_metadata_hashing.py Adds coverage for determinism, invariants, and behavioral differences when metadata is included/excluded.
README.md Documents how to opt into metadata hashing and the empty-metadata invariant.
docs/metamorphic/specs/2026-06-18-include-metadata-hashing-design.md Captures the intended algorithm, path conventions, and API changes for the feature.
docs/metamorphic/plans/2026-06-18-include-metadata-hashing.md Provides an implementation plan corresponding to the design/spec.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/starfix/arrow_digester.py Outdated
Comment on lines 13 to 18
from typing import TYPE_CHECKING

if TYPE_CHECKING:
from hashlib import _Hash

import pyarrow as pa

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Removed the from hashlib import _Hash import from the TYPE_CHECKING block and replaced it with a small module-level _Hasher Protocol that declares the single update(data: bytes | bytearray) -> None method. No private stdlib types referenced anywhere now.

Comment thread src/starfix/arrow_digester.py Outdated
@kurodo3

kurodo3 Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor Author

Review round summary

One change made in response to Copilot's two comments (both pointing at the same issue):

Replace hashlib._Hash with a local _Hasher Protocol

hashlib._Hash is a private CPython implementation detail — not guaranteed stable across Python versions or type checkers. Replaced it with a small public Protocol defined at module level:

class _Hasher(Protocol):
    """Minimal protocol for a hash accumulator (e.g. ``hashlib.sha256()``)."""

    def update(self, data: bytes | bytearray, /) -> None: ...
  • Removed from hashlib import _Hash from the TYPE_CHECKING block entirely.
  • Updated _update_metadata_hash(hasher: _Hash, ...)_update_metadata_hash(hasher: _Hasher, ...).
  • _Hasher uses only public stdlib types (Protocol, bytes, bytearray) — no private imports remain.
  • All 165 tests still pass.

@eywalker eywalker merged commit f9d194c into main Jun 18, 2026
8 checks passed
@eywalker eywalker deleted the eywalker/plt-1734-starfix-python-expose-schemafield-metadata-hashing-option branch June 18, 2026 06:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants