feat(extension-types): Arrow/Polars extension type semantic type system (PLT-1663)#183
Conversation
… on external registrations
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…lars exc chain (PLT-1653) - Change `from exc` to `from None` in _register_polars_ext_type to suppress internal Polars error details (matches Arrow helper pattern) - Add docstring to _sanitize function documenting its purpose - Replace all double-backtick RST notation with single-backtick Google style throughout registry.py (docstrings in functions and class methods) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…instance (PLT-1653) Add public exports for ExtensionTypeRegistry and the module-level instance extension_type_registry to the extension_types package. This enables users to access the registry directly from orcapod.extension_types. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…etadata collision test - Add `from __future__ import annotations` to extension_types/__init__.py after module docstring - Add missing test_register_polars_global_collision_different_metadata_raises test to match PyArrow test coverage Closes PLT-1653
…larify deserialize semantics, use .storage API (PLT-1653)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…lker Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…op-level detection (PLT-1654)
…ify dedup test (PLT-1654) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…nce check (PLT-1654)
…Visitor for extension types - Add visit_extension() to ArrowTypeDataVisitor with passthrough default - visit() now checks for pa.ExtensionType BEFORE struct check to prevent extension types with struct storage being swallowed by visit_struct - Rewrite SemanticHashingVisitor to use type_converter + python_hasher instead of semantic_registry; resolves extension types via the logical type registry and produces pa.large_binary() tokens of the form <ext_name>::<method>:<digest> - Update StarfixArrowHasher constructor to accept type_converter instead of semantic_registry; python_hasher resolved lazily from context to break the circular dependency in the JSON spec - Update v0.1.json component ordering so type_converter is created before arrow_hasher (which now requires it) - Update versioned_hashers.py, test_starfix_arrow_hasher.py, and test_semantic_registry.py to use the new API - Add tests/test_hashing/test_extension_type_hashing.py with 6 tests covering dispatch routing, hash stability, null passthrough, and binary encoding format Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… import - Fix test_visit_dispatches_to_visit_extension_for_extension_types to use a real file (via tmp_path fixture) and call super() in visit_extension to validate the full dispatch chain - Move deferred 'from typing import Any' to module-level import at top of visitors.py and use typing.Any in visit_extension method Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ArrowHasher constructor - Deleted SemanticArrowHasher class (old struct-based arrow hasher) - Renamed python_hasher parameter to semantic_hasher (required positional) - Removed lazy resolution logic (_get_python_hasher) — semantic_hasher is now required - Removed unused imports: arrow_serialization, arrow_utils, SemanticTypeRegistry Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Added semantic_hasher=ctx.semantic_hasher to _make_hasher() - Moved get_default_context import inside _make_hasher() (no top-level import needed) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…semantic_registry - Rewrote v0.1.json: removed semantic_registry and type_handler_registry keys - Added python_type_semantic_hasher_registry key with all type handlers - arrow_hasher now wires in both type_converter and semantic_hasher refs - pa.Table/pa.RecordBatch handlers added back using lazy arrow_hasher resolution to break the circular dep (ArrowTableSemanticHasher now accepts optional arg) - context_schema.json: removed semantic_registry property, renamed type_handler_registry -> python_type_semantic_hasher_registry - versioned_hashers.py: get_versioned_semantic_arrow_hasher() now sources both type_converter and semantic_hasher from default context via resolve_context() Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…-based hashing system - Deleted src/orcapod/semantic_types/semantic_registry.py - Deleted src/orcapod/semantic_types/semantic_struct_converters.py - Removed SemanticTypeRegistry export from semantic_types/__init__.py - Removed SemanticStructConverterProtocol from protocols/semantic_types_protocols.py - Deleted tests/test_hashing/test_file_hashing_consistency.py (used SemanticArrowHasher) - Deleted tests/test_semantic_types/ directory (tested deleted classes) - Updated docstrings/comments to remove old class name references - ArrowTableSemanticHasher: made arrow_hasher optional with lazy context resolution to break the circular dep (registry -> ArrowTableSemanticHasher -> arrow_hasher -> registry) - context_schema.json: updated descriptions and examples to use new class names Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…est, fix stale log message - get_default_arrow_hasher(): remove broken set_cacher() call and cache_file_hash param; StarfixArrowHasher has no set_cacher method. Replaced with a simple delegate to get_default_context().arrow_hasher. - semantic_hasher.py: update stale log message from SemanticHasherProtocol (non-strict) to SemanticAwarePythonHasher (non-strict) with more descriptive text. - test_extension_type_hashing.py: add test_unregistered_python_type_passes_through to TestSemanticHashingVisitorExtension covering the branch where extension type is recognized but has no semantic hasher registered. Note: Fix 2 (remove pa.Table/pa.RecordBatch from v0.1.json) was not applied because Datagram.identity_structure() explicitly depends on ArrowTableSemanticHasher being registered to hash pa.Table objects (documented in datagram.py docstring). Removing these entries breaks 1 test (test_merge_join) and the fundamental design. The lazy context resolution in ArrowTableSemanticHasher._get_arrow_hasher() already handles the circular dependency concern raised in the review. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…nnotations, list element type inference, always register ArrowTableSemanticHasher
…return Any Handlers now return a representative Python structure instead of a ContentHash. SemanticAwarePythonHasher.hash_object() feeds the result back into hash_object() for final hashing, treating a returned ContentHash as a terminal (no re-hashing). Simple built-in handlers (UUID, Bytes, Function, TypeObject, SpecialForm, GenericAlias, UnionType) are simplified to return plain Python values/structures. Semantic handlers that compute content-based hashes from external data (Path, UPath, ArrowTable) continue to return ContentHash directly, which short-circuits hashing as before. Hash values are preserved: the extra hash_object() call is a no-op for the simple handlers since the structure they return is identical to what they previously delegated to hash_object() internally. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…peHandler, hash() → handle() The protocol is now called PythonTypeHandler with a handle() method, more clearly reflecting its role as a type-specific handler that returns a representative Python structure rather than computing a ContentHash directly. All built-in handlers, the registry, the dispatch in SemanticAwarePythonHasher, and all test helpers are updated accordingly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ticHasherRegistry → PythonTypeHandlerRegistry Mechanical rename across all source files, JSON configs, and tests: - PathSemanticHasher → PathHandler, UPathSemanticHasher → UPathHandler, UUIDSemanticHasher → UUIDHandler, BytesSemanticHasher → BytesHandler, FunctionSemanticHasher → FunctionHandler, TypeObjectSemanticHasher → TypeObjectHandler, SpecialFormSemanticHasher → SpecialFormHandler, GenericAliasSemanticHasher → GenericAliasHandler, UnionTypeSemanticHasher → UnionTypeHandler, ArrowTableSemanticHasher → ArrowTableHandler, SchemaSemanticHasher → SchemaHandler - register_builtin_python_type_semantic_hashers → register_builtin_python_type_handlers - PythonTypeSemanticHasherRegistry → PythonTypeHandlerRegistry - BuiltinPythonTypeSemanticHasherRegistry → BuiltinPythonTypeHandlerRegistry - get_default_python_type_semantic_hasher_registry → get_default_python_type_handler_registry - type_semantic_hasher_registry param/property → type_handler_registry - JSON config keys and _class values updated accordingly No logic changes. All 3717 tests pass.
…thonHasher in comments
…er in v0.1 config
…ol, decouple type annotations - PythonTypeHandlerRegistry: rename get_semantic_hasher → get_handler, get_semantic_hasher_for_type → get_handler_for_type, has_semantic_hasher → has_handler; update all call sites - hashing_protocols: add HandlerRegistryProtocol abstracting over the concrete registry; SemanticHasherProtocol.type_handler_registry now returns HandlerRegistryProtocol instead of PythonTypeHandlerRegistry; PythonTypeHandler.handle() now uses SemanticHasherProtocol instead of the concrete SemanticAwarePythonHasher; remove concrete-class imports from TYPE_CHECKING block - versioned_hashers: type type_handler_registry param as HandlerRegistryProtocol | None instead of Any | None; drop unused Any import - Update test_hashing.py and test_semantic_hasher.py for renamed methods Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ncrete types - Rename PythonTypeHandler → PythonTypeHandlerProtocol everywhere: class definition in hashing_protocols.py, all type annotations in type_handler_registry.py, hashing/__init__.py export, and all docstring references across builtin_handlers.py and semantic_hashing/__init__.py - Rename CallableWithPod → CallableWithPodProtocol in function_pod.py - SemanticAwarePythonHasher.__init__ now accepts HandlerRegistryProtocol | None instead of PythonTypeHandlerRegistry | None; drop concrete-class import - SemanticAwarePythonHasher.type_handler_registry property now returns HandlerRegistryProtocol instead of PythonTypeHandlerRegistry - ContentIdentifiableMixin now imports and uses SemanticHasherProtocol instead of the concrete SemanticAwarePythonHasher for __init__ param and _get_hasher return type - Update strict-mode error messages to say "no implementation of PythonTypeHandlerProtocol registered"; update matching test assertions Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Replace all `handle()` hasher params with SemanticHasherProtocol (was SemanticAwarePythonHasher) across all 11 builtin handler classes - Change register_builtin_python_type_handlers() registry param from PythonTypeHandlerRegistry to HandlerRegistryProtocol - Remove concrete-class imports from TYPE_CHECKING block; import SemanticHasherProtocol and HandlerRegistryProtocol from protocols module - Fix content_identifiable_mixin.py docstring example that incorrectly showed SemanticHasherProtocol being instantiated; replace with SemanticAwarePythonHasher (the concrete class) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add register() and __len__() to HandlerRegistryProtocol so the protocol matches every method called on registry inside register_builtin_python_type_handlers(); previously HandlerRegistryProtocol only declared the lookup side of the interface (get_handler, get_handler_for_type, has_handler), leaving register() and len() untyped - Fix test_uuid_handler.py module docstring: s/hash() method behaviour/ handle() dispatch via SemanticAwarePythonHasher/ — UUIDHandler implements handle(), not hash(), and the tests exercise hash_object() dispatch Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Update all docstrings and type annotations to consistently use protocol
types instead of concrete implementation types:
- versioned_hashers.py: fix summary ("SemanticAwarePythonHasher" →
"SemanticHasherProtocol") and type_handler_registry param description
("PythonTypeHandlerRegistry" → "HandlerRegistryProtocol") to match
the hp.HandlerRegistryProtocol annotation already on the parameter
- arrow_hashers.py: change semantic_hasher param annotation and docstring
from SemanticAwarePythonHasher to SemanticHasherProtocol; update
TYPE_CHECKING import accordingly
- visitors.py: same for python_hasher param in SemanticHashingVisitor
- defaults.py: update "owned by SemanticAwarePythonHasher" to
"owned by SemanticHasherProtocol" in get_default_python_type_handler_registry
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
| def semantic_hasher(file_hasher): | ||
| """BaseSemanticHasher wired with the shared file_hasher via PathContentHandler.""" | ||
| registry = TypeHandlerRegistry() | ||
| register_builtin_handlers(registry, file_hasher=file_hasher) |
There was a problem hiding this comment.
Why was this file entirely removed? This still sounds like a perfectly legitimate and necessary test?
There was a problem hiding this comment.
The file was deleted in the PLT-1660 hard cut commit because every class it tested — SemanticArrowHasher, SemanticTypeRegistry, PythonPathStructConverter — was removed as part of that cut. Keeping the file would have left broken imports.
The individual behaviors were covered in the replacement tests:
- Arrow path same-content/different-content hashing →
TestSemanticHashingVisitorExtensionintest_extension_type_hashing.py - Python path Path hashing →
TestSemanticPathHashingintest_semantic_hasher.py
What wasn't ported was the explicit cross-path consistency check (that the Arrow visitor and the direct Python hasher embed the same digest for the same file). It's guaranteed structurally — visit_extension calls hash_object on the recovered Python object, the same method the direct path uses — but the explicit test was missing.
Added TestCrossPathConsistency to test_extension_type_hashing.py in f73dcba with two tests that make this guarantee explicit and regression-proof.
…shing The PLT-1660 hard cut deleted test_file_hashing_consistency.py because it depended on SemanticArrowHasher, SemanticTypeRegistry, and PythonPathStructConverter — all of which were removed. The test intent (cross-path consistency between the Arrow and Python hashing paths) was not ported to the new extension type system. Add TestCrossPathConsistency to test_extension_type_hashing.py: - test_arrow_and_semantic_hash_same_file_content: verifies that SemanticHashingVisitor.visit_extension and semantic_hasher.hash_object embed the same prefixed digest for the same file content. - test_same_content_two_files_cross_path: verifies that two files with identical content produce matching hash tokens across both paths. Both paths call hash_object on the same Python object by construction, so this is a structural guarantee — but now also an explicit regression test. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Round 1 review addressed.
Details: The file was deleted in the PLT-1660 hard cut because it depended entirely on deleted classes ( |
Summary
This PR merges the
extension-type-systemintegration branch intomain, completing the Arrow/Polars Extension Type Semantic Type System project.All sub-issue PRs targeted the
extension-type-systemintegration branch and have already been merged there. This is the single gate PR that lands the full system onmain.What's included
New extension type infrastructure (
src/orcapod/extension_types/)protocols.py—LogicalTypeprotocol (Arrow I/O contract:extension_name,extension_metadata,storage_type,python_type,python_to_storage,storage_to_python,get_polars_extension_type) — replaces the oldExtensionTypeConverterprototyperegistry.py—LogicalTypeRegistry: per-process cache, factory dispatch, PyArrow + Polars registration, peek-schema → register → read patternschema_walker.py— recursive Arrow schema walker that discovers extension fields in nested/map/list schemasbuiltin_logical_types.py—LogicalPath,LogicalUPath,LogicalUUIDregistered underorcapod.*namespacedataclass_logical_type_factory.py—DataclassLogicalTypeFactory/DataclassLogicalTypefor automatic dataclass ↔ Arrow struct round-tripspydantic_logical_type_factory.py—PydanticLogicalTypeFactory/PydanticLogicalType(pydantic now a required dep)database_hooks.py—ensure_extensions_registeredhook for read pathstype_utils.py— FQCN import helpersSystem integration
UniversalTypeConverter—register_python_class,register_storage_type,python_to_storage,storage_to_python; extension-type identity takes priority over shape-based dispatchDataContext/ v0.1 context —LogicalTypeRegistrywired in;DataclassLogicalTypeFactoryandPydanticLogicalTypeFactoryauto-registeredFunctionPod.__init__— write-side auto-registration triggerConnectorArrowDatabase/DeltaTableDatabase—ensure_extensions_registeredcalled on every readHard cut (PLT-1660)
SemanticTypeRegistryand the old shape-based struct type system entirelyBaseSemanticHasher→SemanticAwarePythonHasher,TypeHandlerProtocol→PythonTypeHandler,PythonTypeSemanticHasherRegistry→PythonTypeHandlerRegistrySemanticHashingVisitorrewritten for extension-type dispatch (visits extension nodes, not struct shapes)Public API additions
orcapod.Path,orcapod.UPath,orcapod.UUID— stable top-level aliasesload_extension_types()convenience method onUniversalTypeConverterCode audit
Per PLT-1663 success criteria — no old naming survives in production code:
ExtensionTypeConverterLogicalTypeprotocolExtensionTypeRegistryLogicalTypeRegistrySemanticTypeRegistryBaseSemanticHasherTests
CI status
All standard CI checks pass (unit tests on Python 3.11 and 3.12, license check).
The
spiral-integrationcheck is expected to pass now that the branch has been rebasedonto
main(picking up PLT-1773: pyspiral0.11.7 → 0.14.9upgrade).Related issues
Closes PLT-1663
Closes PLT-1652
Closes PLT-1653
Closes PLT-1654
Closes PLT-1655
Closes PLT-1656
Closes PLT-1657
Closes PLT-1659
Closes PLT-1660
Closes PLT-1661
Closes PLT-1662
Closes PLT-1668
Closes PLT-1670
Closes PLT-1672
Closes PLT-1701
Closes PLT-1705
Closes PLT-1720
Closes PLT-1731
🤖 Generated with Claude Code