fix(hashing): pydantic/dataclass columns as pipeline columns (ITL-432)#185
Conversation
… fixes Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…s Arrow extension types
Bug A: StarfixArrowHasher._process_table_columns left live pa.ExtensionType
columns intact after the SemanticHashingVisitor passthrough. ArrowDigester
crashed with TypeError (unhashable type) because extension types are used as
dict keys in starfix's _primitive_data_type_string.
Fix: add normalize_extension_columns() to arrow_utils.py — converts any
top-level extension-typed column to IPC storage representation (storage type +
ARROW:extension:* field metadata) using ExtensionArray.storage with no Python
materialization. _process_table_columns calls it after the visitor loop so
ArrowDigester never sees a live pa.ExtensionType.
Bug B: PydanticLogicalType and DataclassLogicalType both called
make_polars_extension_type() without the metadata= argument. The Polars
extension type carried empty metadata, so pl.DataFrame(table).to_arrow()
passed b'' to __arrow_ext_deserialize__, which expected the category bytes
and raised ValueError.
Fix: pass metadata=json.dumps({"category": ...}) to make_polars_extension_type()
in both __init__ methods so the Polars type carries the same metadata string
as the Arrow extension type.
Also logs the efficiency follow-up (ITL-433) in DESIGN_ISSUES.md: the
to_pylist() roundtrip is wasteful for passthrough extension columns and
should be short-circuited in a future refactor.
Closes ITL-432
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Pull request overview
This PR fixes hashing and round-trip stability for Arrow extension-typed columns representing Pydantic models and Python dataclasses, ensuring StarfixArrowHasher can safely hash tables even when extension columns pass through semantic hashing unchanged.
Changes:
- Add
normalize_extension_columns(table)to convert top-levelpa.ExtensionTypecolumns into IPC/Parquet-style storage arrays plusARROW:extension:*field metadata. - Update
StarfixArrowHasher._process_table_columnsto normalize remaining extension columns after the semantic visitor, preventingArrowDigestercrashes on live extension types. - Fix Polars extension type synthesis for Pydantic/Dataclass logical types by passing category metadata so
pl.DataFrame(...).to_arrow()doesn’t fail and hashes remain stable across a Polars round-trip.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
src/orcapod/utils/arrow_utils.py |
Adds normalize_extension_columns() utility for IPC-style extension normalization. |
src/orcapod/hashing/arrow_hashers.py |
Normalizes extension columns after semantic processing to avoid starfix hashing failures. |
src/orcapod/extension_types/pydantic_logical_type_factory.py |
Passes Polars extension metadata to preserve category on round-trip. |
src/orcapod/extension_types/dataclass_logical_type_factory.py |
Same Polars metadata fix for dataclass extension types. |
tests/test_hashing/test_pydantic_dataclass_hashing.py |
Adds regression coverage for hashing + Polars round-trip for both types. |
DESIGN_ISSUES.md |
Documents a follow-up (ITL-433) about avoiding unnecessary Python materialization. |
superpowers/specs/2026-06-27-itl-432-pydantic-dataclass-pipeline-columns-design.md |
Captures the final design rationale and the two addressed bugs. |
CLAUDE.md |
Updates contributor guidance on PR target branch. |
.zed/rules |
Mirrors the PR target branch guidance for Zed tooling. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # combine_chunks() returns a single ExtensionArray; .storage is a | ||
| # zero-copy view of the underlying buffers with the storage type. | ||
| storage_arr = table.column(i).combine_chunks().storage | ||
| serialized = ext_type.__arrow_ext_serialize__() |
- Pre-register _Cfg and _Run with the type converter before creating DictSource instances; DictSource uses the default context's converter to build Arrow schemas from data_schema, so types must be registered first. - Replace incorrect content_hash() equality assertion in the no-op filter test with a row-count and column-presence check; filtered and raw streams differ in identity_structure (different producers), so their content hashes are intentionally different even with identical data. - Fix src.process() → src.content_hash() in DictSource tests; DictSource implements StreamProtocol directly and has no process() method. All 23 regression tests now pass. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace combine_chunks().storage with per-chunk .storage iteration to rebuild a ChunkedArray. combine_chunks() allocates new buffers when a column has more than one chunk, contradicting the zero-copy guarantee documented in the function's docstring. Each ExtensionArray chunk's .storage property is a true zero-copy view of the underlying Arrow buffers; building a ChunkedArray from those storage chunks avoids any buffer allocation while preserving the original chunk structure. Update the docstring to accurately reflect the chunked approach. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Review response — Copilot comment on
|
| # combine_chunks() returns a single ExtensionArray; .storage is a | ||
| # zero-copy view of the underlying buffers with the storage type. | ||
| storage_arr = table.column(i).combine_chunks().storage | ||
| serialized = ext_type.__arrow_ext_serialize__() |
…processing In StarfixArrowHasher._process_table_columns, the short-circuit path appends table.column(i) which is pa.ChunkedArray, not pa.Array — fix the annotation to list[pa.Array | pa.ChunkedArray]. In normalize_extension_columns, all appended items are pa.ChunkedArray (both the storage ChunkedArray built from extension chunks and the passthrough table.column(i) call) — add explicit annotations list[pa.ChunkedArray] and list[pa.Field] to make the types clear. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Review response (round 2)Comment 1 —
|
Review response (round 3)Copilot raised no new comments this round — all prior review issues have been fully addressed. Summary of changes across all review rounds:
The PR is ready for human review. |
| return arrow_type | ||
|
|
||
|
|
||
| def normalize_extension_columns(table: "pa.Table") -> "pa.Table": |
There was a problem hiding this comment.
add thorough unit tests for normalize_extension_columns
…lumns 14 tests covering every documented property of the function: - fast-path identity return when no extension columns are present - storage type substitution for extension columns - ARROW:extension:name and ARROW:extension:metadata written to field metadata - __arrow_ext_serialize__ output round-trips correctly through the metadata - data values preserved after normalization - non-extension columns pass through unchanged in mixed tables - column count stability - schema-level metadata preserved - pre-existing per-field metadata preserved alongside new ARROW:extension:* keys - nullable=False and nullable=True both preserved - multi-chunk column: data values correct after normalization - multi-chunk column: chunk count preserved (verifies the zero-copy guarantee) - multiple extension columns of different types all normalized independently Uses two self-contained test-only pa.ExtensionType subclasses (_TestIntExt, _TestBinaryExt) registered at module import time to keep these tests free of the orcapod type-converter machinery. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Review response (round 4)Comment — add thorough unit tests for `normalize_extension_columns`Added in commit 687ec63: 14 unit tests in a new `TestNormalizeExtensionColumns` class in `tests/test_utils/test_arrow_utils.py`. Coverage:
Uses two self-contained test-only `pa.ExtensionType` subclasses (`_TestIntExt`, `_TestBinaryExt`) registered at module import time, keeping these tests independent of the orcapod type-converter. All 55 tests in the file pass (14 new + 41 pre-existing). |
Review response (round 5)The latest review (eywalker, approved) contains no new inline comments — all prior review feedback has been fully addressed. The PR is approved and ready to merge at your discretion. Summary of all changes made during review:
|
Summary
StarfixArrowHasher._process_table_columnsleft livepa.ExtensionTypecolumns intact after theSemanticHashingVisitorpassthrough, causingArrowDigesterto crash withTypeError: unhashable type(extension types can't be dict keys in starfix's_primitive_data_type_string).PydanticLogicalTypeandDataclassLogicalTypeboth calledmake_polars_extension_type()without themetadata=argument, sopl.DataFrame(table).to_arrow()raisedValueErrorwhen__arrow_ext_deserialize__receivedb''instead of the expected category bytes.Changes
src/orcapod/utils/arrow_utils.pynormalize_extension_columns(table)utility — converts top-level extension-typed columns to IPC storage representation (ExtensionArray.storage+ARROW:extension:*field metadata), no Python materializationsrc/orcapod/hashing/arrow_hashers.py_process_table_columnscallsnormalize_extension_columnsafter the visitor loop soArrowDigesternever sees a livepa.ExtensionTypesrc/orcapod/extension_types/pydantic_logical_type_factory.pymetadata=json.dumps({"category": PYDANTIC_CATEGORY})tomake_polars_extension_type()src/orcapod/extension_types/dataclass_logical_type_factory.pyDATACLASS_CATEGORYtests/test_hashing/test_pydantic_dataclass_hashing.pyDESIGN_ISSUES.mdto_pylist()roundtrip is wasteful for passthrough extension columnssuperpowers/specs/…Test plan
uv run pytest tests/test_hashing/test_pydantic_dataclass_hashing.py— 10/10 passuv run pytest tests/test_hashing/— 338/338 passFollow-up
ITL-433 tracks short-circuiting the
to_pylist()roundtrip for extension columns with no registered handler (the efficiency improvement deferred from this PR).Closes ITL-432