fix(hashing): pydantic/dataclass columns as pipeline columns (ITL-432) by kurodo3[bot] · Pull Request #185 · nauticalab/orcapod-python

kurodo3 · 2026-06-27T01:10:27Z

Summary

Bug A: StarfixArrowHasher._process_table_columns left live pa.ExtensionType columns intact after the SemanticHashingVisitor passthrough, causing ArrowDigester to crash with TypeError: unhashable type (extension types can't be dict keys in starfix's _primitive_data_type_string).
Bug B: PydanticLogicalType and DataclassLogicalType both called make_polars_extension_type() without the metadata= argument, so pl.DataFrame(table).to_arrow() raised ValueError when __arrow_ext_deserialize__ received b'' instead of the expected category bytes.

Changes

File	Change
`src/orcapod/utils/arrow_utils.py`	New `normalize_extension_columns(table)` utility — converts top-level extension-typed columns to IPC storage representation (`ExtensionArray.storage` + `ARROW:extension:*` field metadata), no Python materialization
`src/orcapod/hashing/arrow_hashers.py`	`_process_table_columns` calls `normalize_extension_columns` after the visitor loop so `ArrowDigester` never sees a live `pa.ExtensionType`
`src/orcapod/extension_types/pydantic_logical_type_factory.py`	Pass `metadata=json.dumps({"category": PYDANTIC_CATEGORY})` to `make_polars_extension_type()`
`src/orcapod/extension_types/dataclass_logical_type_factory.py`	Same fix with `DATACLASS_CATEGORY`
`tests/test_hashing/test_pydantic_dataclass_hashing.py`	10 regression tests: Bug A (no TypeError), Bug B (no ValueError, hash stable through Polars round-trip)
`DESIGN_ISSUES.md`	H5: efficiency follow-up (ITL-433) — `to_pylist()` roundtrip is wasteful for passthrough extension columns
`superpowers/specs/…`	Revised spec reflecting final design (no semantic handlers; IPC normalization approach)

Test plan

uv run pytest tests/test_hashing/test_pydantic_dataclass_hashing.py — 10/10 pass
uv run pytest tests/test_hashing/ — 338/338 pass

Follow-up

ITL-433 tracks short-circuiting the to_pylist() roundtrip for extension columns with no registered handler (the efficiency improvement deferred from this PR).

Closes ITL-432

… fixes Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…s Arrow extension types Bug A: StarfixArrowHasher._process_table_columns left live pa.ExtensionType columns intact after the SemanticHashingVisitor passthrough. ArrowDigester crashed with TypeError (unhashable type) because extension types are used as dict keys in starfix's _primitive_data_type_string. Fix: add normalize_extension_columns() to arrow_utils.py — converts any top-level extension-typed column to IPC storage representation (storage type + ARROW:extension:* field metadata) using ExtensionArray.storage with no Python materialization. _process_table_columns calls it after the visitor loop so ArrowDigester never sees a live pa.ExtensionType. Bug B: PydanticLogicalType and DataclassLogicalType both called make_polars_extension_type() without the metadata= argument. The Polars extension type carried empty metadata, so pl.DataFrame(table).to_arrow() passed b'' to __arrow_ext_deserialize__, which expected the category bytes and raised ValueError. Fix: pass metadata=json.dumps({"category": ...}) to make_polars_extension_type() in both __init__ methods so the Polars type carries the same metadata string as the Arrow extension type. Also logs the efficiency follow-up (ITL-433) in DESIGN_ISSUES.md: the to_pylist() roundtrip is wasteful for passthrough extension columns and should be short-circuited in a future refactor. Closes ITL-432 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

codecov · 2026-06-27T01:11:52Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Copilot

Pull request overview

This PR fixes hashing and round-trip stability for Arrow extension-typed columns representing Pydantic models and Python dataclasses, ensuring StarfixArrowHasher can safely hash tables even when extension columns pass through semantic hashing unchanged.

Changes:

Add normalize_extension_columns(table) to convert top-level pa.ExtensionType columns into IPC/Parquet-style storage arrays plus ARROW:extension:* field metadata.
Update StarfixArrowHasher._process_table_columns to normalize remaining extension columns after the semantic visitor, preventing ArrowDigester crashes on live extension types.
Fix Polars extension type synthesis for Pydantic/Dataclass logical types by passing category metadata so pl.DataFrame(...).to_arrow() doesn’t fail and hashes remain stable across a Polars round-trip.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
`src/orcapod/utils/arrow_utils.py`	Adds `normalize_extension_columns()` utility for IPC-style extension normalization.
`src/orcapod/hashing/arrow_hashers.py`	Normalizes extension columns after semantic processing to avoid starfix hashing failures.
`src/orcapod/extension_types/pydantic_logical_type_factory.py`	Passes Polars extension metadata to preserve category on round-trip.
`src/orcapod/extension_types/dataclass_logical_type_factory.py`	Same Polars metadata fix for dataclass extension types.
`tests/test_hashing/test_pydantic_dataclass_hashing.py`	Adds regression coverage for hashing + Polars round-trip for both types.
`DESIGN_ISSUES.md`	Documents a follow-up (ITL-433) about avoiding unnecessary Python materialization.
`superpowers/specs/2026-06-27-itl-432-pydantic-dataclass-pipeline-columns-design.md`	Captures the final design rationale and the two addressed bugs.
`CLAUDE.md`	Updates contributor guidance on PR target branch.
`.zed/rules`	Mirrors the PR target branch guidance for Zed tooling.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+            # combine_chunks() returns a single ExtensionArray; .storage is a
+            # zero-copy view of the underlying buffers with the storage type.
+            storage_arr = table.column(i).combine_chunks().storage
+            serialized = ext_type.__arrow_ext_serialize__()


- Pre-register _Cfg and _Run with the type converter before creating DictSource instances; DictSource uses the default context's converter to build Arrow schemas from data_schema, so types must be registered first. - Replace incorrect content_hash() equality assertion in the no-op filter test with a row-count and column-presence check; filtered and raw streams differ in identity_structure (different producers), so their content hashes are intentionally different even with identical data. - Fix src.process() → src.content_hash() in DictSource tests; DictSource implements StreamProtocol directly and has no process() method. All 23 regression tests now pass. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replace combine_chunks().storage with per-chunk .storage iteration to rebuild a ChunkedArray. combine_chunks() allocates new buffers when a column has more than one chunk, contradicting the zero-copy guarantee documented in the function's docstring. Each ExtensionArray chunk's .storage property is a true zero-copy view of the underlying Arrow buffers; building a ChunkedArray from those storage chunks avoids any buffer allocation while preserving the original chunk structure. Update the docstring to accurately reflect the chunked approach. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

kurodo3 · 2026-06-27T01:29:59Z

Review response — Copilot comment on `normalize_extension_columns`

Fixed in commit 5fa019a.

Replaced combine_chunks().storage with per-chunk .storage iteration that rebuilds a ChunkedArray:

col = table.column(i)
storage_arr = pa.chunked_array(
    [chunk.storage for chunk in col.chunks],
    type=ext_type.storage_type,
)

Each ExtensionArray chunk's .storage property is a zero-copy view of the underlying Arrow buffers, so no buffer allocation occurs regardless of how many chunks the column has. The original chunking is also preserved as a side-effect. The docstring has been updated to accurately describe the chunked approach rather than claiming combine_chunks() is zero-copy.

Copilot

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

+            # combine_chunks() returns a single ExtensionArray; .storage is a
+            # zero-copy view of the underlying buffers with the storage type.
+            storage_arr = table.column(i).combine_chunks().storage
+            serialized = ext_type.__arrow_ext_serialize__()


…processing In StarfixArrowHasher._process_table_columns, the short-circuit path appends table.column(i) which is pa.ChunkedArray, not pa.Array — fix the annotation to list[pa.Array | pa.ChunkedArray]. In normalize_extension_columns, all appended items are pa.ChunkedArray (both the storage ChunkedArray built from extension chunks and the passthrough table.column(i) call) — add explicit annotations list[pa.ChunkedArray] and list[pa.Field] to make the types clear. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

kurodo3 · 2026-06-27T01:33:15Z

Review response (round 2)

Comment 1 — `combine_chunks()` not zero-copy (comment 3484874937)

Already addressed in commit 5fa019a (round 1 response). The code now converts each chunk's .storage individually and rebuilds a ChunkedArray, so no buffer allocation occurs regardless of how many chunks a column has:

col = table.column(i)
storage_arr = pa.chunked_array(
    [chunk.storage for chunk in col.chunks],
    type=ext_type.storage_type,
)

The docstring was also updated in that same commit to accurately describe the chunked approach.

Comment 2 — `new_columns: list[pa.Array]` incorrect annotation (comment 3484874949)

Fixed in commit 5b8f6fd.

StarfixArrowHasher._process_table_columns (arrow_hashers.py): the short-circuit path appends table.column(i) which is a pa.ChunkedArray, while the normal path appends pa.array(...) which is a pa.Array. Updated:

# before
new_columns: list[pa.Array] = []
# after
new_columns: list[pa.Array | pa.ChunkedArray] = []

normalize_extension_columns (arrow_utils.py): all appended items are pa.ChunkedArray (both the storage ChunkedArray built from extension chunks and the passthrough table.column(i)). Added explicit annotations:

new_columns: list[pa.ChunkedArray] = []
new_fields: list[pa.Field] = []

Copilot

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated no new comments.

kurodo3 · 2026-06-27T01:36:14Z

Review response (round 3)

Copilot raised no new comments this round — all prior review issues have been fully addressed. Summary of changes across all review rounds:

Round	Commit	Change
1	`5fa019af`	Replaced `combine_chunks().storage` with per-chunk `.storage` iteration in `normalize_extension_columns`; updated docstring
2	`5b8f6fd6`	Fixed `new_columns` type annotations: `list[pa.Array]` → `list[pa.Array \| pa.ChunkedArray]` in `_process_table_columns`; added explicit `list[pa.ChunkedArray]` + `list[pa.Field]` in `normalize_extension_columns`

The PR is ready for human review.

eywalker · 2026-06-27T01:52:22Z

    return arrow_type


+def normalize_extension_columns(table: "pa.Table") -> "pa.Table":


add thorough unit tests for normalize_extension_columns

…lumns 14 tests covering every documented property of the function: - fast-path identity return when no extension columns are present - storage type substitution for extension columns - ARROW:extension:name and ARROW:extension:metadata written to field metadata - __arrow_ext_serialize__ output round-trips correctly through the metadata - data values preserved after normalization - non-extension columns pass through unchanged in mixed tables - column count stability - schema-level metadata preserved - pre-existing per-field metadata preserved alongside new ARROW:extension:* keys - nullable=False and nullable=True both preserved - multi-chunk column: data values correct after normalization - multi-chunk column: chunk count preserved (verifies the zero-copy guarantee) - multiple extension columns of different types all normalized independently Uses two self-contained test-only pa.ExtensionType subclasses (_TestIntExt, _TestBinaryExt) registered at module import time to keep these tests free of the orcapod type-converter machinery. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

kurodo3 · 2026-06-27T01:56:21Z

Review response (round 4)

Comment — add thorough unit tests for `normalize_extension_columns`

Added in commit 687ec63: 14 unit tests in a new `TestNormalizeExtensionColumns` class in `tests/test_utils/test_arrow_utils.py`.

Coverage:

Test	What it verifies
`test_no_extension_columns_returns_same_object`	Fast-path: identity return when no ext cols present
`test_extension_column_type_becomes_storage_type`	Extension type → storage type substitution
`test_extension_name_written_to_field_metadata`	`ARROW:extension:name` written correctly
`test_extension_metadata_matches_arrow_ext_serialize`	`ARROW:extension:metadata` equals `arrow_ext_serialize()` output
`test_data_values_preserved`	Storage values match original extension data
`test_non_extension_columns_pass_through_unchanged`	Plain columns in a mixed table unaffected
`test_column_count_unchanged`	Table has the same number of columns
`test_schema_level_metadata_preserved`	Schema-level metadata carried through
`test_existing_field_metadata_preserved`	Pre-existing per-field metadata preserved alongside new `ARROW:extension:*` keys
`test_nullable_false_preserved`	`nullable=False` flag preserved
`test_nullable_true_preserved`	`nullable=True` flag preserved
`test_multi_chunk_data_values_preserved`	Multi-chunk column: all data values correct
`test_multi_chunk_column_chunk_count_preserved`	Multi-chunk column: chunk count preserved (directly verifies the zero-copy guarantee)
`test_multiple_extension_columns_all_normalized`	Multiple extension columns of different types all independently normalized

Uses two self-contained test-only `pa.ExtensionType` subclasses (`_TestIntExt`, `_TestBinaryExt`) registered at module import time, keeping these tests independent of the orcapod type-converter.

All 55 tests in the file pass (14 new + 41 pre-existing).

kurodo3 · 2026-06-27T02:05:26Z

Review response (round 5)

The latest review (eywalker, approved) contains no new inline comments — all prior review feedback has been fully addressed. The PR is approved and ready to merge at your discretion.

Summary of all changes made during review:

Commit	Change
`5fa019af`	Replace `combine_chunks().storage` with per-chunk `.storage` iteration in `normalize_extension_columns`; update docstring
`5b8f6fd6`	Fix `new_columns` type annotations: `list[pa.Array
`687ec637`	Add 14 unit tests for `normalize_extension_columns` in `TestNormalizeExtensionColumns`

kurodo3 Bot and others added 2 commits June 27, 2026 00:24

docs(itl-432): add design spec for pydantic/dataclass pipeline column…

331783a

… fixes Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

eywalker requested review from brian-arnold, Copilot and eywalker June 27, 2026 01:13

Copilot started reviewing on behalf of eywalker June 27, 2026 01:14 View session

Copilot AI reviewed Jun 27, 2026

View reviewed changes

eywalker requested a review from Copilot June 27, 2026 01:28

Copilot started reviewing on behalf of eywalker June 27, 2026 01:29 View session

Copilot AI reviewed Jun 27, 2026

View reviewed changes

eywalker requested a review from Copilot June 27, 2026 01:32

Copilot started reviewing on behalf of eywalker June 27, 2026 01:33 View session

Copilot AI reviewed Jun 27, 2026

View reviewed changes

eywalker requested changes Jun 27, 2026

View reviewed changes

eywalker approved these changes Jun 27, 2026

View reviewed changes

eywalker merged commit fef6add into main Jun 27, 2026
11 checks passed

eywalker deleted the eywalker/itl-432-pydanticdataclass-models-cant-flow-through-pipelines-as branch June 27, 2026 02:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(hashing): pydantic/dataclass columns as pipeline columns (ITL-432)#185

fix(hashing): pydantic/dataclass columns as pipeline columns (ITL-432)#185
eywalker merged 6 commits into
mainfrom
eywalker/itl-432-pydanticdataclass-models-cant-flow-through-pipelines-as

kurodo3 Bot commented Jun 27, 2026

Uh oh!

codecov Bot commented Jun 27, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

kurodo3 Bot commented Jun 27, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

kurodo3 Bot commented Jun 27, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

kurodo3 Bot commented Jun 27, 2026

Uh oh!

eywalker Jun 27, 2026

Uh oh!

kurodo3 Bot commented Jun 27, 2026

Uh oh!

Uh oh!

kurodo3 Bot commented Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		return arrow_type


		def normalize_extension_columns(table: "pa.Table") -> "pa.Table":

Uh oh!

Conversation

kurodo3 Bot commented Jun 27, 2026

Summary

Changes

Test plan

Follow-up

Uh oh!

codecov Bot commented Jun 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

kurodo3 Bot commented Jun 27, 2026

Review response — Copilot comment on normalize_extension_columns

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

kurodo3 Bot commented Jun 27, 2026

Review response (round 2)

Comment 1 — combine_chunks() not zero-copy (comment 3484874937)

Comment 2 — new_columns: list[pa.Array] incorrect annotation (comment 3484874949)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

kurodo3 Bot commented Jun 27, 2026

Review response (round 3)

Uh oh!

eywalker Jun 27, 2026

Choose a reason for hiding this comment

Uh oh!

kurodo3 Bot commented Jun 27, 2026

Review response (round 4)

Comment — add thorough unit tests for `normalize_extension_columns`

Uh oh!

Uh oh!

kurodo3 Bot commented Jun 27, 2026

Review response (round 5)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov Bot commented Jun 27, 2026 •

edited

Loading

Review response — Copilot comment on `normalize_extension_columns`

Comment 1 — `combine_chunks()` not zero-copy (comment 3484874937)

Comment 2 — `new_columns: list[pa.Array]` incorrect annotation (comment 3484874949)