Skip to content

feat(extension-types): Arrow/Polars extension type semantic type system (PLT-1663)#183

Merged
eywalker merged 206 commits into
mainfrom
extension-type-system
Jun 25, 2026
Merged

feat(extension-types): Arrow/Polars extension type semantic type system (PLT-1663)#183
eywalker merged 206 commits into
mainfrom
extension-type-system

Conversation

@kurodo3

@kurodo3 kurodo3 Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Summary

This PR merges the extension-type-system integration branch into main, completing the Arrow/Polars Extension Type Semantic Type System project.

All sub-issue PRs targeted the extension-type-system integration branch and have already been merged there. This is the single gate PR that lands the full system on main.

What's included

New extension type infrastructure (src/orcapod/extension_types/)

  • protocols.pyLogicalType protocol (Arrow I/O contract: extension_name, extension_metadata, storage_type, python_type, python_to_storage, storage_to_python, get_polars_extension_type) — replaces the old ExtensionTypeConverter prototype
  • registry.pyLogicalTypeRegistry: per-process cache, factory dispatch, PyArrow + Polars registration, peek-schema → register → read pattern
  • schema_walker.py — recursive Arrow schema walker that discovers extension fields in nested/map/list schemas
  • builtin_logical_types.pyLogicalPath, LogicalUPath, LogicalUUID registered under orcapod.* namespace
  • dataclass_logical_type_factory.pyDataclassLogicalTypeFactory / DataclassLogicalType for automatic dataclass ↔ Arrow struct round-trips
  • pydantic_logical_type_factory.pyPydanticLogicalTypeFactory / PydanticLogicalType (pydantic now a required dep)
  • database_hooks.pyensure_extensions_registered hook for read paths
  • type_utils.py — FQCN import helpers

System integration

  • UniversalTypeConverterregister_python_class, register_storage_type, python_to_storage, storage_to_python; extension-type identity takes priority over shape-based dispatch
  • DataContext / v0.1 contextLogicalTypeRegistry wired in; DataclassLogicalTypeFactory and PydanticLogicalTypeFactory auto-registered
  • FunctionPod.__init__ — write-side auto-registration trigger
  • ConnectorArrowDatabase / DeltaTableDatabaseensure_extensions_registered called on every read

Hard cut (PLT-1660)

  • Deleted SemanticTypeRegistry and the old shape-based struct type system entirely
  • Renamed BaseSemanticHasherSemanticAwarePythonHasher, TypeHandlerProtocolPythonTypeHandler, PythonTypeSemanticHasherRegistryPythonTypeHandlerRegistry
  • SemanticHashingVisitor rewritten for extension-type dispatch (visits extension nodes, not struct shapes)

Public API additions

  • orcapod.Path, orcapod.UPath, orcapod.UUID — stable top-level aliases
  • load_extension_types() convenience method on UniversalTypeConverter

Code audit

Per PLT-1663 success criteria — no old naming survives in production code:

Symbol Status
ExtensionTypeConverter Deleted — replaced by LogicalType protocol
ExtensionTypeRegistry Deleted — replaced by LogicalTypeRegistry
SemanticTypeRegistry Deleted — only appears in v0.1.json changelog comment
BaseSemanticHasher Renamed — only appears in v0.1.json changelog comment
Shape-based dispatch code Removed — only explanatory comments remain

Tests

  • Unit tests for all new modules (registry, schema_walker, builtin types, factories)
  • Per-process cache behaviour tests
  • Schema compatibility tests
  • Parquet and Delta Lake end-to-end round-trip integration tests
  • Write-side registration tests
  • Default context auto-registration tests

CI status

All standard CI checks pass (unit tests on Python 3.11 and 3.12, license check).
The spiral-integration check is expected to pass now that the branch has been rebased
onto main (picking up PLT-1773: pyspiral 0.11.7 → 0.14.9 upgrade).

Related issues

Closes PLT-1663
Closes PLT-1652
Closes PLT-1653
Closes PLT-1654
Closes PLT-1655
Closes PLT-1656
Closes PLT-1657
Closes PLT-1659
Closes PLT-1660
Closes PLT-1661
Closes PLT-1662
Closes PLT-1668
Closes PLT-1670
Closes PLT-1672
Closes PLT-1701
Closes PLT-1705
Closes PLT-1720
Closes PLT-1731

🤖 Generated with Claude Code

kurodo3 Bot and others added 30 commits June 25, 2026 03:45
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…lars exc chain (PLT-1653)

- Change `from exc` to `from None` in _register_polars_ext_type to suppress
  internal Polars error details (matches Arrow helper pattern)
- Add docstring to _sanitize function documenting its purpose
- Replace all double-backtick RST notation with single-backtick Google style
  throughout registry.py (docstrings in functions and class methods)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…instance (PLT-1653)

Add public exports for ExtensionTypeRegistry and the module-level instance
extension_type_registry to the extension_types package. This enables users to
access the registry directly from orcapod.extension_types.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…etadata collision test

- Add `from __future__ import annotations` to extension_types/__init__.py after module docstring
- Add missing test_register_polars_global_collision_different_metadata_raises test to match PyArrow test coverage

Closes PLT-1653
…larify deserialize semantics, use .storage API (PLT-1653)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…lker

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ify dedup test (PLT-1654)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
kurodo3 Bot and others added 20 commits June 25, 2026 03:45
…Visitor for extension types

- Add visit_extension() to ArrowTypeDataVisitor with passthrough default
- visit() now checks for pa.ExtensionType BEFORE struct check to prevent
  extension types with struct storage being swallowed by visit_struct
- Rewrite SemanticHashingVisitor to use type_converter + python_hasher
  instead of semantic_registry; resolves extension types via the logical
  type registry and produces pa.large_binary() tokens of the form
  <ext_name>::<method>:<digest>
- Update StarfixArrowHasher constructor to accept type_converter instead
  of semantic_registry; python_hasher resolved lazily from context to
  break the circular dependency in the JSON spec
- Update v0.1.json component ordering so type_converter is created before
  arrow_hasher (which now requires it)
- Update versioned_hashers.py, test_starfix_arrow_hasher.py, and
  test_semantic_registry.py to use the new API
- Add tests/test_hashing/test_extension_type_hashing.py with 6 tests
  covering dispatch routing, hash stability, null passthrough, and binary
  encoding format

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… import

- Fix test_visit_dispatches_to_visit_extension_for_extension_types to use a real
  file (via tmp_path fixture) and call super() in visit_extension to validate the
  full dispatch chain
- Move deferred 'from typing import Any' to module-level import at top of
  visitors.py and use typing.Any in visit_extension method

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ArrowHasher constructor

- Deleted SemanticArrowHasher class (old struct-based arrow hasher)
- Renamed python_hasher parameter to semantic_hasher (required positional)
- Removed lazy resolution logic (_get_python_hasher) — semantic_hasher is now required
- Removed unused imports: arrow_serialization, arrow_utils, SemanticTypeRegistry

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Added semantic_hasher=ctx.semantic_hasher to _make_hasher()
- Moved get_default_context import inside _make_hasher() (no top-level import needed)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…semantic_registry

- Rewrote v0.1.json: removed semantic_registry and type_handler_registry keys
- Added python_type_semantic_hasher_registry key with all type handlers
- arrow_hasher now wires in both type_converter and semantic_hasher refs
- pa.Table/pa.RecordBatch handlers added back using lazy arrow_hasher resolution
  to break the circular dep (ArrowTableSemanticHasher now accepts optional arg)
- context_schema.json: removed semantic_registry property, renamed
  type_handler_registry -> python_type_semantic_hasher_registry
- versioned_hashers.py: get_versioned_semantic_arrow_hasher() now sources both
  type_converter and semantic_hasher from default context via resolve_context()

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…-based hashing system

- Deleted src/orcapod/semantic_types/semantic_registry.py
- Deleted src/orcapod/semantic_types/semantic_struct_converters.py
- Removed SemanticTypeRegistry export from semantic_types/__init__.py
- Removed SemanticStructConverterProtocol from protocols/semantic_types_protocols.py
- Deleted tests/test_hashing/test_file_hashing_consistency.py (used SemanticArrowHasher)
- Deleted tests/test_semantic_types/ directory (tested deleted classes)
- Updated docstrings/comments to remove old class name references
- ArrowTableSemanticHasher: made arrow_hasher optional with lazy context resolution
  to break the circular dep (registry -> ArrowTableSemanticHasher -> arrow_hasher -> registry)
- context_schema.json: updated descriptions and examples to use new class names

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…est, fix stale log message

- get_default_arrow_hasher(): remove broken set_cacher() call and cache_file_hash param;
  StarfixArrowHasher has no set_cacher method. Replaced with a simple delegate to
  get_default_context().arrow_hasher.
- semantic_hasher.py: update stale log message from SemanticHasherProtocol (non-strict)
  to SemanticAwarePythonHasher (non-strict) with more descriptive text.
- test_extension_type_hashing.py: add test_unregistered_python_type_passes_through to
  TestSemanticHashingVisitorExtension covering the branch where extension type is
  recognized but has no semantic hasher registered.

Note: Fix 2 (remove pa.Table/pa.RecordBatch from v0.1.json) was not applied because
Datagram.identity_structure() explicitly depends on ArrowTableSemanticHasher being
registered to hash pa.Table objects (documented in datagram.py docstring). Removing
these entries breaks 1 test (test_merge_join) and the fundamental design. The lazy
context resolution in ArrowTableSemanticHasher._get_arrow_hasher() already handles
the circular dependency concern raised in the review.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…nnotations, list element type inference, always register ArrowTableSemanticHasher
…return Any

Handlers now return a representative Python structure instead of a
ContentHash. SemanticAwarePythonHasher.hash_object() feeds the result
back into hash_object() for final hashing, treating a returned
ContentHash as a terminal (no re-hashing).

Simple built-in handlers (UUID, Bytes, Function, TypeObject,
SpecialForm, GenericAlias, UnionType) are simplified to return plain
Python values/structures. Semantic handlers that compute content-based
hashes from external data (Path, UPath, ArrowTable) continue to return
ContentHash directly, which short-circuits hashing as before.

Hash values are preserved: the extra hash_object() call is a no-op for
the simple handlers since the structure they return is identical to what
they previously delegated to hash_object() internally.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…peHandler, hash() → handle()

The protocol is now called PythonTypeHandler with a handle() method,
more clearly reflecting its role as a type-specific handler that returns
a representative Python structure rather than computing a ContentHash
directly.

All built-in handlers, the registry, the dispatch in
SemanticAwarePythonHasher, and all test helpers are updated accordingly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ticHasherRegistry → PythonTypeHandlerRegistry

Mechanical rename across all source files, JSON configs, and tests:
- PathSemanticHasher → PathHandler, UPathSemanticHasher → UPathHandler,
  UUIDSemanticHasher → UUIDHandler, BytesSemanticHasher → BytesHandler,
  FunctionSemanticHasher → FunctionHandler, TypeObjectSemanticHasher → TypeObjectHandler,
  SpecialFormSemanticHasher → SpecialFormHandler,
  GenericAliasSemanticHasher → GenericAliasHandler,
  UnionTypeSemanticHasher → UnionTypeHandler,
  ArrowTableSemanticHasher → ArrowTableHandler,
  SchemaSemanticHasher → SchemaHandler
- register_builtin_python_type_semantic_hashers → register_builtin_python_type_handlers
- PythonTypeSemanticHasherRegistry → PythonTypeHandlerRegistry
- BuiltinPythonTypeSemanticHasherRegistry → BuiltinPythonTypeHandlerRegistry
- get_default_python_type_semantic_hasher_registry → get_default_python_type_handler_registry
- type_semantic_hasher_registry param/property → type_handler_registry
- JSON config keys and _class values updated accordingly

No logic changes. All 3717 tests pass.
…ol, decouple type annotations

- PythonTypeHandlerRegistry: rename get_semantic_hasher → get_handler,
  get_semantic_hasher_for_type → get_handler_for_type,
  has_semantic_hasher → has_handler; update all call sites
- hashing_protocols: add HandlerRegistryProtocol abstracting over the
  concrete registry; SemanticHasherProtocol.type_handler_registry now
  returns HandlerRegistryProtocol instead of PythonTypeHandlerRegistry;
  PythonTypeHandler.handle() now uses SemanticHasherProtocol instead of
  the concrete SemanticAwarePythonHasher; remove concrete-class imports
  from TYPE_CHECKING block
- versioned_hashers: type type_handler_registry param as
  HandlerRegistryProtocol | None instead of Any | None; drop unused
  Any import
- Update test_hashing.py and test_semantic_hasher.py for renamed methods

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ncrete types

- Rename PythonTypeHandler → PythonTypeHandlerProtocol everywhere: class
  definition in hashing_protocols.py, all type annotations in
  type_handler_registry.py, hashing/__init__.py export, and all docstring
  references across builtin_handlers.py and semantic_hashing/__init__.py
- Rename CallableWithPod → CallableWithPodProtocol in function_pod.py
- SemanticAwarePythonHasher.__init__ now accepts HandlerRegistryProtocol | None
  instead of PythonTypeHandlerRegistry | None; drop concrete-class import
- SemanticAwarePythonHasher.type_handler_registry property now returns
  HandlerRegistryProtocol instead of PythonTypeHandlerRegistry
- ContentIdentifiableMixin now imports and uses SemanticHasherProtocol instead
  of the concrete SemanticAwarePythonHasher for __init__ param and _get_hasher
  return type
- Update strict-mode error messages to say "no implementation of
  PythonTypeHandlerProtocol registered"; update matching test assertions

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Replace all `handle()` hasher params with SemanticHasherProtocol
  (was SemanticAwarePythonHasher) across all 11 builtin handler classes
- Change register_builtin_python_type_handlers() registry param from
  PythonTypeHandlerRegistry to HandlerRegistryProtocol
- Remove concrete-class imports from TYPE_CHECKING block; import
  SemanticHasherProtocol and HandlerRegistryProtocol from protocols module
- Fix content_identifiable_mixin.py docstring example that incorrectly
  showed SemanticHasherProtocol being instantiated; replace with
  SemanticAwarePythonHasher (the concrete class)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add register() and __len__() to HandlerRegistryProtocol so the protocol
  matches every method called on registry inside
  register_builtin_python_type_handlers(); previously HandlerRegistryProtocol
  only declared the lookup side of the interface (get_handler,
  get_handler_for_type, has_handler), leaving register() and len() untyped
- Fix test_uuid_handler.py module docstring: s/hash() method behaviour/
  handle() dispatch via SemanticAwarePythonHasher/ — UUIDHandler implements
  handle(), not hash(), and the tests exercise hash_object() dispatch

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Update all docstrings and type annotations to consistently use protocol
types instead of concrete implementation types:

- versioned_hashers.py: fix summary ("SemanticAwarePythonHasher" →
  "SemanticHasherProtocol") and type_handler_registry param description
  ("PythonTypeHandlerRegistry" → "HandlerRegistryProtocol") to match
  the hp.HandlerRegistryProtocol annotation already on the parameter
- arrow_hashers.py: change semantic_hasher param annotation and docstring
  from SemanticAwarePythonHasher to SemanticHasherProtocol; update
  TYPE_CHECKING import accordingly
- visitors.py: same for python_hasher param in SemanticHashingVisitor
- defaults.py: update "owned by SemanticAwarePythonHasher" to
  "owned by SemanticHasherProtocol" in get_default_python_type_handler_registry

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
def semantic_hasher(file_hasher):
"""BaseSemanticHasher wired with the shared file_hasher via PathContentHandler."""
registry = TypeHandlerRegistry()
register_builtin_handlers(registry, file_hasher=file_hasher)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this file entirely removed? This still sounds like a perfectly legitimate and necessary test?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The file was deleted in the PLT-1660 hard cut commit because every class it tested — SemanticArrowHasher, SemanticTypeRegistry, PythonPathStructConverter — was removed as part of that cut. Keeping the file would have left broken imports.

The individual behaviors were covered in the replacement tests:

  • Arrow path same-content/different-content hashing → TestSemanticHashingVisitorExtension in test_extension_type_hashing.py
  • Python path Path hashing → TestSemanticPathHashing in test_semantic_hasher.py

What wasn't ported was the explicit cross-path consistency check (that the Arrow visitor and the direct Python hasher embed the same digest for the same file). It's guaranteed structurally — visit_extension calls hash_object on the recovered Python object, the same method the direct path uses — but the explicit test was missing.

Added TestCrossPathConsistency to test_extension_type_hashing.py in f73dcba with two tests that make this guarantee explicit and regression-proof.

…shing

The PLT-1660 hard cut deleted test_file_hashing_consistency.py because it
depended on SemanticArrowHasher, SemanticTypeRegistry, and
PythonPathStructConverter — all of which were removed. The test intent
(cross-path consistency between the Arrow and Python hashing paths) was
not ported to the new extension type system.

Add TestCrossPathConsistency to test_extension_type_hashing.py:
- test_arrow_and_semantic_hash_same_file_content: verifies that
  SemanticHashingVisitor.visit_extension and semantic_hasher.hash_object
  embed the same prefixed digest for the same file content.
- test_same_content_two_files_cross_path: verifies that two files with
  identical content produce matching hash tokens across both paths.

Both paths call hash_object on the same Python object by construction,
so this is a structural guarantee — but now also an explicit regression test.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@kurodo3

kurodo3 Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor Author

Round 1 review addressed.

Comment File Action
Why was test_file_hashing_consistency.py entirely removed? tests/test_hashing/test_file_hashing_consistency.py Added TestCrossPathConsistency to test_extension_type_hashing.py (f73dcba)

Details: The file was deleted in the PLT-1660 hard cut because it depended entirely on deleted classes (SemanticArrowHasher, SemanticTypeRegistry, PythonPathStructConverter). The individual Arrow-path and Python-path behaviors were covered in test_extension_type_hashing.py and test_semantic_hasher.py respectively, but the explicit cross-path consistency check (Arrow visitor and Python hasher produce the same hash token for the same file) was not ported. Added TestCrossPathConsistency with two tests to make that structural guarantee explicit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant