Skip to content

fix(hashing): pydantic/dataclass columns as pipeline columns (ITL-432)#185

Merged
eywalker merged 6 commits into
mainfrom
eywalker/itl-432-pydanticdataclass-models-cant-flow-through-pipelines-as
Jun 27, 2026
Merged

fix(hashing): pydantic/dataclass columns as pipeline columns (ITL-432)#185
eywalker merged 6 commits into
mainfrom
eywalker/itl-432-pydanticdataclass-models-cant-flow-through-pipelines-as

Conversation

@kurodo3

@kurodo3 kurodo3 Bot commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Bug A: StarfixArrowHasher._process_table_columns left live pa.ExtensionType columns intact after the SemanticHashingVisitor passthrough, causing ArrowDigester to crash with TypeError: unhashable type (extension types can't be dict keys in starfix's _primitive_data_type_string).
  • Bug B: PydanticLogicalType and DataclassLogicalType both called make_polars_extension_type() without the metadata= argument, so pl.DataFrame(table).to_arrow() raised ValueError when __arrow_ext_deserialize__ received b'' instead of the expected category bytes.

Changes

File Change
src/orcapod/utils/arrow_utils.py New normalize_extension_columns(table) utility — converts top-level extension-typed columns to IPC storage representation (ExtensionArray.storage + ARROW:extension:* field metadata), no Python materialization
src/orcapod/hashing/arrow_hashers.py _process_table_columns calls normalize_extension_columns after the visitor loop so ArrowDigester never sees a live pa.ExtensionType
src/orcapod/extension_types/pydantic_logical_type_factory.py Pass metadata=json.dumps({"category": PYDANTIC_CATEGORY}) to make_polars_extension_type()
src/orcapod/extension_types/dataclass_logical_type_factory.py Same fix with DATACLASS_CATEGORY
tests/test_hashing/test_pydantic_dataclass_hashing.py 10 regression tests: Bug A (no TypeError), Bug B (no ValueError, hash stable through Polars round-trip)
DESIGN_ISSUES.md H5: efficiency follow-up (ITL-433) — to_pylist() roundtrip is wasteful for passthrough extension columns
superpowers/specs/… Revised spec reflecting final design (no semantic handlers; IPC normalization approach)

Test plan

  • uv run pytest tests/test_hashing/test_pydantic_dataclass_hashing.py — 10/10 pass
  • uv run pytest tests/test_hashing/ — 338/338 pass

Follow-up

ITL-433 tracks short-circuiting the to_pylist() roundtrip for extension columns with no registered handler (the efficiency improvement deferred from this PR).

Closes ITL-432

kurodo3 Bot and others added 2 commits June 27, 2026 00:24
… fixes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…s Arrow extension types

Bug A: StarfixArrowHasher._process_table_columns left live pa.ExtensionType
columns intact after the SemanticHashingVisitor passthrough.  ArrowDigester
crashed with TypeError (unhashable type) because extension types are used as
dict keys in starfix's _primitive_data_type_string.

Fix: add normalize_extension_columns() to arrow_utils.py — converts any
top-level extension-typed column to IPC storage representation (storage type +
ARROW:extension:* field metadata) using ExtensionArray.storage with no Python
materialization.  _process_table_columns calls it after the visitor loop so
ArrowDigester never sees a live pa.ExtensionType.

Bug B: PydanticLogicalType and DataclassLogicalType both called
make_polars_extension_type() without the metadata= argument.  The Polars
extension type carried empty metadata, so pl.DataFrame(table).to_arrow()
passed b'' to __arrow_ext_deserialize__, which expected the category bytes
and raised ValueError.

Fix: pass metadata=json.dumps({"category": ...}) to make_polars_extension_type()
in both __init__ methods so the Polars type carries the same metadata string
as the Arrow extension type.

Also logs the efficiency follow-up (ITL-433) in DESIGN_ISSUES.md: the
to_pylist() roundtrip is wasteful for passthrough extension columns and
should be short-circuited in a future refactor.

Closes ITL-432

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@codecov

codecov Bot commented Jun 27, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes hashing and round-trip stability for Arrow extension-typed columns representing Pydantic models and Python dataclasses, ensuring StarfixArrowHasher can safely hash tables even when extension columns pass through semantic hashing unchanged.

Changes:

  • Add normalize_extension_columns(table) to convert top-level pa.ExtensionType columns into IPC/Parquet-style storage arrays plus ARROW:extension:* field metadata.
  • Update StarfixArrowHasher._process_table_columns to normalize remaining extension columns after the semantic visitor, preventing ArrowDigester crashes on live extension types.
  • Fix Polars extension type synthesis for Pydantic/Dataclass logical types by passing category metadata so pl.DataFrame(...).to_arrow() doesn’t fail and hashes remain stable across a Polars round-trip.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
src/orcapod/utils/arrow_utils.py Adds normalize_extension_columns() utility for IPC-style extension normalization.
src/orcapod/hashing/arrow_hashers.py Normalizes extension columns after semantic processing to avoid starfix hashing failures.
src/orcapod/extension_types/pydantic_logical_type_factory.py Passes Polars extension metadata to preserve category on round-trip.
src/orcapod/extension_types/dataclass_logical_type_factory.py Same Polars metadata fix for dataclass extension types.
tests/test_hashing/test_pydantic_dataclass_hashing.py Adds regression coverage for hashing + Polars round-trip for both types.
DESIGN_ISSUES.md Documents a follow-up (ITL-433) about avoiding unnecessary Python materialization.
superpowers/specs/2026-06-27-itl-432-pydantic-dataclass-pipeline-columns-design.md Captures the final design rationale and the two addressed bugs.
CLAUDE.md Updates contributor guidance on PR target branch.
.zed/rules Mirrors the PR target branch guidance for Zed tooling.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/orcapod/utils/arrow_utils.py Outdated
Comment on lines +329 to +332
# combine_chunks() returns a single ExtensionArray; .storage is a
# zero-copy view of the underlying buffers with the storage type.
storage_arr = table.column(i).combine_chunks().storage
serialized = ext_type.__arrow_ext_serialize__()
- Pre-register _Cfg and _Run with the type converter before creating
  DictSource instances; DictSource uses the default context's converter
  to build Arrow schemas from data_schema, so types must be registered
  first.
- Replace incorrect content_hash() equality assertion in the no-op
  filter test with a row-count and column-presence check; filtered and
  raw streams differ in identity_structure (different producers), so
  their content hashes are intentionally different even with identical
  data.
- Fix src.process() → src.content_hash() in DictSource tests; DictSource
  implements StreamProtocol directly and has no process() method.

All 23 regression tests now pass.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace combine_chunks().storage with per-chunk .storage iteration
to rebuild a ChunkedArray. combine_chunks() allocates new buffers
when a column has more than one chunk, contradicting the zero-copy
guarantee documented in the function's docstring.

Each ExtensionArray chunk's .storage property is a true zero-copy
view of the underlying Arrow buffers; building a ChunkedArray from
those storage chunks avoids any buffer allocation while preserving
the original chunk structure.

Update the docstring to accurately reflect the chunked approach.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@kurodo3

kurodo3 Bot commented Jun 27, 2026

Copy link
Copy Markdown
Contributor Author

Review response — Copilot comment on normalize_extension_columns

Fixed in commit 5fa019a.

Replaced combine_chunks().storage with per-chunk .storage iteration that rebuilds a ChunkedArray:

col = table.column(i)
storage_arr = pa.chunked_array(
    [chunk.storage for chunk in col.chunks],
    type=ext_type.storage_type,
)

Each ExtensionArray chunk's .storage property is a zero-copy view of the underlying Arrow buffers, so no buffer allocation occurs regardless of how many chunks the column has. The original chunking is also preserved as a side-effect. The docstring has been updated to accurately describe the chunked approach rather than claiming combine_chunks() is zero-copy.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Comment thread src/orcapod/utils/arrow_utils.py Outdated
Comment on lines +329 to +332
# combine_chunks() returns a single ExtensionArray; .storage is a
# zero-copy view of the underlying buffers with the storage type.
storage_arr = table.column(i).combine_chunks().storage
serialized = ext_type.__arrow_ext_serialize__()
Comment thread src/orcapod/hashing/arrow_hashers.py Outdated
…processing

In StarfixArrowHasher._process_table_columns, the short-circuit path
appends table.column(i) which is pa.ChunkedArray, not pa.Array — fix
the annotation to list[pa.Array | pa.ChunkedArray].

In normalize_extension_columns, all appended items are pa.ChunkedArray
(both the storage ChunkedArray built from extension chunks and the
passthrough table.column(i) call) — add explicit annotations
list[pa.ChunkedArray] and list[pa.Field] to make the types clear.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@kurodo3

kurodo3 Bot commented Jun 27, 2026

Copy link
Copy Markdown
Contributor Author

Review response (round 2)

Comment 1 — combine_chunks() not zero-copy (comment 3484874937)

Already addressed in commit 5fa019a (round 1 response). The code now converts each chunk's .storage individually and rebuilds a ChunkedArray, so no buffer allocation occurs regardless of how many chunks a column has:

col = table.column(i)
storage_arr = pa.chunked_array(
    [chunk.storage for chunk in col.chunks],
    type=ext_type.storage_type,
)

The docstring was also updated in that same commit to accurately describe the chunked approach.


Comment 2 — new_columns: list[pa.Array] incorrect annotation (comment 3484874949)

Fixed in commit 5b8f6fd.

StarfixArrowHasher._process_table_columns (arrow_hashers.py): the short-circuit path appends table.column(i) which is a pa.ChunkedArray, while the normal path appends pa.array(...) which is a pa.Array. Updated:

# before
new_columns: list[pa.Array] = []
# after
new_columns: list[pa.Array | pa.ChunkedArray] = []

normalize_extension_columns (arrow_utils.py): all appended items are pa.ChunkedArray (both the storage ChunkedArray built from extension chunks and the passthrough table.column(i)). Added explicit annotations:

new_columns: list[pa.ChunkedArray] = []
new_fields: list[pa.Field] = []

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated no new comments.

@kurodo3

kurodo3 Bot commented Jun 27, 2026

Copy link
Copy Markdown
Contributor Author

Review response (round 3)

Copilot raised no new comments this round — all prior review issues have been fully addressed. Summary of changes across all review rounds:

Round Commit Change
1 5fa019af Replaced combine_chunks().storage with per-chunk .storage iteration in normalize_extension_columns; updated docstring
2 5b8f6fd6 Fixed new_columns type annotations: list[pa.Array]list[pa.Array | pa.ChunkedArray] in _process_table_columns; added explicit list[pa.ChunkedArray] + list[pa.Field] in normalize_extension_columns

The PR is ready for human review.

return arrow_type


def normalize_extension_columns(table: "pa.Table") -> "pa.Table":

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add thorough unit tests for normalize_extension_columns

…lumns

14 tests covering every documented property of the function:
- fast-path identity return when no extension columns are present
- storage type substitution for extension columns
- ARROW:extension:name and ARROW:extension:metadata written to field metadata
- __arrow_ext_serialize__ output round-trips correctly through the metadata
- data values preserved after normalization
- non-extension columns pass through unchanged in mixed tables
- column count stability
- schema-level metadata preserved
- pre-existing per-field metadata preserved alongside new ARROW:extension:* keys
- nullable=False and nullable=True both preserved
- multi-chunk column: data values correct after normalization
- multi-chunk column: chunk count preserved (verifies the zero-copy guarantee)
- multiple extension columns of different types all normalized independently

Uses two self-contained test-only pa.ExtensionType subclasses
(_TestIntExt, _TestBinaryExt) registered at module import time to
keep these tests free of the orcapod type-converter machinery.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@kurodo3

kurodo3 Bot commented Jun 27, 2026

Copy link
Copy Markdown
Contributor Author

Review response (round 4)

Comment — add thorough unit tests for `normalize_extension_columns`

Added in commit 687ec63: 14 unit tests in a new `TestNormalizeExtensionColumns` class in `tests/test_utils/test_arrow_utils.py`.

Coverage:

Test What it verifies
`test_no_extension_columns_returns_same_object` Fast-path: identity return when no ext cols present
`test_extension_column_type_becomes_storage_type` Extension type → storage type substitution
`test_extension_name_written_to_field_metadata` `ARROW:extension:name` written correctly
`test_extension_metadata_matches_arrow_ext_serialize` `ARROW:extension:metadata` equals `arrow_ext_serialize()` output
`test_data_values_preserved` Storage values match original extension data
`test_non_extension_columns_pass_through_unchanged` Plain columns in a mixed table unaffected
`test_column_count_unchanged` Table has the same number of columns
`test_schema_level_metadata_preserved` Schema-level metadata carried through
`test_existing_field_metadata_preserved` Pre-existing per-field metadata preserved alongside new `ARROW:extension:*` keys
`test_nullable_false_preserved` `nullable=False` flag preserved
`test_nullable_true_preserved` `nullable=True` flag preserved
`test_multi_chunk_data_values_preserved` Multi-chunk column: all data values correct
`test_multi_chunk_column_chunk_count_preserved` Multi-chunk column: chunk count preserved (directly verifies the zero-copy guarantee)
`test_multiple_extension_columns_all_normalized` Multiple extension columns of different types all independently normalized

Uses two self-contained test-only `pa.ExtensionType` subclasses (`_TestIntExt`, `_TestBinaryExt`) registered at module import time, keeping these tests independent of the orcapod type-converter.

All 55 tests in the file pass (14 new + 41 pre-existing).

@eywalker eywalker merged commit fef6add into main Jun 27, 2026
11 checks passed
@eywalker eywalker deleted the eywalker/itl-432-pydanticdataclass-models-cant-flow-through-pipelines-as branch June 27, 2026 02:01
@kurodo3

kurodo3 Bot commented Jun 27, 2026

Copy link
Copy Markdown
Contributor Author

Review response (round 5)

The latest review (eywalker, approved) contains no new inline comments — all prior review feedback has been fully addressed. The PR is approved and ready to merge at your discretion.

Summary of all changes made during review:

Commit Change
5fa019af Replace combine_chunks().storage with per-chunk .storage iteration in normalize_extension_columns; update docstring
5b8f6fd6 Fix new_columns type annotations: `list[pa.Array
687ec637 Add 14 unit tests for normalize_extension_columns in TestNormalizeExtensionColumns

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants