feat(PLT-1660): hard cut — delete old semantic type system, wire extension type hashing#182
Conversation
Covers: visitor visit_extension dispatch, SemanticHashingVisitor rewrite, StarfixArrowHasher constructor update, renames (BaseSemanticHasher -> SemanticAwarePythonHasher, TypeHandlerRegistry -> PythonTypeHandlerRegistry), v0.1.json / context_schema.json changes, and deletion plan for old SemanticTypeRegistry / SemanticStructConverter files. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Folds in: TypeHandlerProtocol.handle()->Any tightened to PythonTypeSemanticHasherProtocol.hash()->ContentHash; all builtin handlers renamed to *SemanticHasher and updated to return ContentHash directly; registry renamed to PythonTypeSemanticHasherRegistry with updated method names. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…nd colon namespace Extension name dots replaced with colons (orcapod.path -> orcapod:path); "::" used as separator between type prefix and hash so the boundary is unambiguous (to_prefixed_digest uses only single ":"). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…emanticHasherProtocol, tighten hash() → ContentHash
…gistry, rename methods
…ythonHasher, simplify dispatch Also updates defaults.py: replaces get_default_type_handler_registry with get_default_python_type_semantic_hasher_registry to match the new registry API.
…in mixin and core
…ix hash() return type - Renamed _DummyHandler class to _DummySemanticHasher for clarity - Changed hash() method to return ContentHash via hasher.hash_object() instead of raw string - Updated all 13 usages throughout the test class Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…Visitor for extension types - Add visit_extension() to ArrowTypeDataVisitor with passthrough default - visit() now checks for pa.ExtensionType BEFORE struct check to prevent extension types with struct storage being swallowed by visit_struct - Rewrite SemanticHashingVisitor to use type_converter + python_hasher instead of semantic_registry; resolves extension types via the logical type registry and produces pa.large_binary() tokens of the form <ext_name>::<method>:<digest> - Update StarfixArrowHasher constructor to accept type_converter instead of semantic_registry; python_hasher resolved lazily from context to break the circular dependency in the JSON spec - Update v0.1.json component ordering so type_converter is created before arrow_hasher (which now requires it) - Update versioned_hashers.py, test_starfix_arrow_hasher.py, and test_semantic_registry.py to use the new API - Add tests/test_hashing/test_extension_type_hashing.py with 6 tests covering dispatch routing, hash stability, null passthrough, and binary encoding format Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… import - Fix test_visit_dispatches_to_visit_extension_for_extension_types to use a real file (via tmp_path fixture) and call super() in visit_extension to validate the full dispatch chain - Move deferred 'from typing import Any' to module-level import at top of visitors.py and use typing.Any in visit_extension method Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ArrowHasher constructor - Deleted SemanticArrowHasher class (old struct-based arrow hasher) - Renamed python_hasher parameter to semantic_hasher (required positional) - Removed lazy resolution logic (_get_python_hasher) — semantic_hasher is now required - Removed unused imports: arrow_serialization, arrow_utils, SemanticTypeRegistry Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Added semantic_hasher=ctx.semantic_hasher to _make_hasher() - Moved get_default_context import inside _make_hasher() (no top-level import needed) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…semantic_registry - Rewrote v0.1.json: removed semantic_registry and type_handler_registry keys - Added python_type_semantic_hasher_registry key with all type handlers - arrow_hasher now wires in both type_converter and semantic_hasher refs - pa.Table/pa.RecordBatch handlers added back using lazy arrow_hasher resolution to break the circular dep (ArrowTableSemanticHasher now accepts optional arg) - context_schema.json: removed semantic_registry property, renamed type_handler_registry -> python_type_semantic_hasher_registry - versioned_hashers.py: get_versioned_semantic_arrow_hasher() now sources both type_converter and semantic_hasher from default context via resolve_context() Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…-based hashing system - Deleted src/orcapod/semantic_types/semantic_registry.py - Deleted src/orcapod/semantic_types/semantic_struct_converters.py - Removed SemanticTypeRegistry export from semantic_types/__init__.py - Removed SemanticStructConverterProtocol from protocols/semantic_types_protocols.py - Deleted tests/test_hashing/test_file_hashing_consistency.py (used SemanticArrowHasher) - Deleted tests/test_semantic_types/ directory (tested deleted classes) - Updated docstrings/comments to remove old class name references - ArrowTableSemanticHasher: made arrow_hasher optional with lazy context resolution to break the circular dep (registry -> ArrowTableSemanticHasher -> arrow_hasher -> registry) - context_schema.json: updated descriptions and examples to use new class names Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…est, fix stale log message - get_default_arrow_hasher(): remove broken set_cacher() call and cache_file_hash param; StarfixArrowHasher has no set_cacher method. Replaced with a simple delegate to get_default_context().arrow_hasher. - semantic_hasher.py: update stale log message from SemanticHasherProtocol (non-strict) to SemanticAwarePythonHasher (non-strict) with more descriptive text. - test_extension_type_hashing.py: add test_unregistered_python_type_passes_through to TestSemanticHashingVisitorExtension covering the branch where extension type is recognized but has no semantic hasher registered. Note: Fix 2 (remove pa.Table/pa.RecordBatch from v0.1.json) was not applied because Datagram.identity_structure() explicitly depends on ArrowTableSemanticHasher being registered to hash pa.Table objects (documented in datagram.py docstring). Removing these entries breaks 1 test (test_merge_join) and the fundamental design. The lazy context resolution in ArrowTableSemanticHasher._get_arrow_hasher() already handles the circular dependency concern raised in the review. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Pull request overview
This PR removes the legacy shape-based “semantic type” system and migrates Arrow semantic hashing to dispatch on pa.ExtensionType, using the extension-type system as the source of truth. It also tightens/renames the Python semantic hashing APIs (handlers → semantic hashers) and updates the default v0.1 data context wiring accordingly.
Changes:
- Deleted the old
SemanticTypeRegistry+ struct-signature semantic converters and their test suite. - Reworked Arrow semantic hashing to hash extension-typed column values via a new
visit_extension()hook, and wiredStarfixArrowHasherto usetype_converter + semantic_hasher. - Renamed/standardized semantic hashing interfaces (
TypeHandler*→PythonTypeSemanticHasher*,handle()→hash() -> ContentHash) and updated contexts/tests.
Reviewed changes
Copilot reviewed 35 out of 35 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_semantic_types/test_uuid_struct_converter.py | Deleted legacy semantic-struct UUID converter tests. |
| tests/test_semantic_types/test_upath_struct_converter.py | Deleted legacy semantic-struct UPath converter tests. |
| tests/test_semantic_types/test_universal_converter.py | Deleted legacy “semantic_types” test coverage (converter now driven by extension types elsewhere). |
| tests/test_semantic_types/test_semantic_struct_converters.py | Deleted tests for legacy SemanticStructConverterBase. |
| tests/test_semantic_types/test_semantic_registry.py | Deleted tests for legacy SemanticTypeRegistry. |
| tests/test_semantic_types/test_schema_arrow_equality.py | Deleted schema/logical-equality tests tied to the removed semantic registry. |
| tests/test_semantic_types/test_pydata_utils.py | Deleted tests for legacy semantic_types pydata utils module usage. |
| tests/test_semantic_types/test_path_struct_converter.py | Deleted legacy semantic-struct Path converter tests. |
| tests/test_hashing/test_uuid_handler.py | Updated UUID hashing tests to use the new semantic hasher API. |
| tests/test_hashing/test_starfix_arrow_hasher.py | Updated StarfixArrowHasher tests to use default context wiring (type_converter + semantic_hasher). |
| tests/test_hashing/test_semantic_hasher.py | Updated comprehensive semantic-hasher tests for renamed protocols/registry and hash() semantics. |
| tests/test_hashing/test_file_hashing_consistency.py | Deleted integration tests that depended on the removed struct-based path hashing. |
| tests/test_hashing/test_extension_type_hashing.py | Added tests validating extension-type dispatch and hash-token encoding/passthrough behavior. |
| superpowers/specs/2026-06-24-plt-1660-hard-cut-extension-type-hashing.md | Added detailed design/spec doc for the migration and API renames. |
| src/orcapod/semantic_types/semantic_struct_converters.py | Deleted legacy struct-based semantic converters. |
| src/orcapod/semantic_types/semantic_registry.py | Deleted legacy semantic registry implementation. |
| src/orcapod/semantic_types/init.py | Removed legacy SemanticTypeRegistry export. |
| src/orcapod/protocols/semantic_types_protocols.py | Removed SemanticStructConverterProtocol (legacy). |
| src/orcapod/protocols/hashing_protocols.py | Replaced TypeHandler protocol with PythonTypeSemanticHasherProtocol.hash() -> ContentHash and renamed registry accessor. |
| src/orcapod/hashing/visitors.py | Added extension-type dispatch and rewrote SemanticHashingVisitor to hash extension-typed values. |
| src/orcapod/hashing/versioned_hashers.py | Updated versioned factories to build StarfixArrowHasher from the default context. |
| src/orcapod/hashing/semantic_hashing/type_handler_registry.py | Renamed registry and methods to semantic-hasher terminology. |
| src/orcapod/hashing/semantic_hashing/semantic_hasher.py | Renamed semantic hasher and simplified dispatch to single-step semantic_hasher.hash(). |
| src/orcapod/hashing/semantic_hashing/content_identifiable_mixin.py | Updated mixin docs/types for renamed semantic hasher class. |
| src/orcapod/hashing/semantic_hashing/builtin_handlers.py | Renamed built-ins to *SemanticHasher and updated to return ContentHash. |
| src/orcapod/hashing/semantic_hashing/init.py | Updated re-exports for renamed semantic hashing classes/functions. |
| src/orcapod/hashing/defaults.py | Renamed default-registry accessor and simplified arrow-hasher getter. |
| src/orcapod/hashing/arrow_hashers.py | Removed legacy hasher and rewired StarfixArrowHasher to semantic-visit extension types. |
| src/orcapod/hashing/init.py | Updated hashing public API exports for renamed types/registry/protocols. |
| src/orcapod/extension_types/registry.py | Removed outdated doc reference to SemanticTypeRegistry. |
| src/orcapod/core/datagrams/datagram.py | Updated docs to reference ArrowTableSemanticHasher. |
| src/orcapod/contexts/data/v0.1.json | Removed semantic_registry; added python_type_semantic_hasher_registry; wired arrow_hasher to type_converter + semantic_hasher. |
| src/orcapod/contexts/data/schemas/context_schema.json | Removed semantic_registry; renamed registry property; updated example wiring. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| for item in data: | ||
| current_element_type, processed_item = self.visit(element_type, item) | ||
| processed_elements.append(processed_item) | ||
|
|
||
| # Use the first non-None element to determine new element type | ||
| if new_element_type is None: | ||
| new_element_type = current_element_type | ||
|
|
||
| # If list was empty or all None, keep original element type | ||
| if new_element_type is None: | ||
| new_element_type = element_type |
There was a problem hiding this comment.
Partially fixed in commit d29079c — changed if new_element_type is None: to if new_element_type is None and processed_item is not None: so the element type is inferred from the first non-null processed element rather than the first element regardless of nullness.
The deeper fix (determining the output element type from the Arrow type alone via a type-probe before the data loop, rather than inferring from data values at all) requires a coupled change to visit_extension null handling and has been tracked in PLT-1809 to be addressed when composite extension type support (List/Struct logical types) lands.
| self, struct_type: "pa.StructType", data: dict | None | ||
| ) -> tuple["pa.StructType", dict]: | ||
| """ | ||
| Helper method to recursively process struct fields. | ||
|
|
||
| This is the default behavior for regular (non-semantic) structs. | ||
| """ | ||
| """Recursively process struct fields. Default behavior for regular structs.""" | ||
| if data is None: | ||
| return struct_type, None |
There was a problem hiding this comment.
Fixed in commit d29079c — updated the return annotation from tuple["pa.StructType", dict] to tuple["pa.StructType", dict | None] in both the base class _visit_struct_fields and the SemanticHashingVisitor override. Both already returned (struct_type, None) on null input.
| self, list_type: "pa.ListType", data: list | None | ||
| ) -> tuple["pa.DataType", list]: | ||
| """ | ||
| Helper method to recursively process list elements. | ||
|
|
||
| This is the default behavior for lists. | ||
| """ | ||
| """Recursively process list elements.""" | ||
| if data is None: | ||
| return list_type, None |
There was a problem hiding this comment.
Fixed in commit d29079c — updated the return annotation from tuple["pa.DataType", list] to tuple["pa.DataType", list | None] in the base class _visit_list_elements. The function already returned (list_type, None) when data is None; the annotation now accurately reflects that.
| hash_bytes = ( | ||
| type_name.encode("ascii") | ||
| + b"::" | ||
| + content_hash.to_prefixed_digest() | ||
| ) |
There was a problem hiding this comment.
Fixed in commit d29079c — changed encode("ascii") to encode("utf-8"). UTF-8 is a strict superset of ASCII so this is a no-op for all current built-in extension names, but correctly handles user-defined non-ASCII extension names without raising UnicodeEncodeError. The b"::" separator remains unambiguous in UTF-8 since non-ASCII bytes always have the high bit set and can never produce the two-byte sequence 0x3a 0x3a.
…nnotations, list element type inference, always register ArrowTableSemanticHasher
Review response — commit d29079cThanks for the thorough review. All 5 comments addressed:
All 3717 tests pass. |
…return Any Handlers now return a representative Python structure instead of a ContentHash. SemanticAwarePythonHasher.hash_object() feeds the result back into hash_object() for final hashing, treating a returned ContentHash as a terminal (no re-hashing). Simple built-in handlers (UUID, Bytes, Function, TypeObject, SpecialForm, GenericAlias, UnionType) are simplified to return plain Python values/structures. Semantic handlers that compute content-based hashes from external data (Path, UPath, ArrowTable) continue to return ContentHash directly, which short-circuits hashing as before. Hash values are preserved: the extra hash_object() call is a no-op for the simple handlers since the structure they return is identical to what they previously delegated to hash_object() internally. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…peHandler, hash() → handle() The protocol is now called PythonTypeHandler with a handle() method, more clearly reflecting its role as a type-specific handler that returns a representative Python structure rather than computing a ContentHash directly. All built-in handlers, the registry, the dispatch in SemanticAwarePythonHasher, and all test helpers are updated accordingly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ticHasherRegistry → PythonTypeHandlerRegistry Mechanical rename across all source files, JSON configs, and tests: - PathSemanticHasher → PathHandler, UPathSemanticHasher → UPathHandler, UUIDSemanticHasher → UUIDHandler, BytesSemanticHasher → BytesHandler, FunctionSemanticHasher → FunctionHandler, TypeObjectSemanticHasher → TypeObjectHandler, SpecialFormSemanticHasher → SpecialFormHandler, GenericAliasSemanticHasher → GenericAliasHandler, UnionTypeSemanticHasher → UnionTypeHandler, ArrowTableSemanticHasher → ArrowTableHandler, SchemaSemanticHasher → SchemaHandler - register_builtin_python_type_semantic_hashers → register_builtin_python_type_handlers - PythonTypeSemanticHasherRegistry → PythonTypeHandlerRegistry - BuiltinPythonTypeSemanticHasherRegistry → BuiltinPythonTypeHandlerRegistry - get_default_python_type_semantic_hasher_registry → get_default_python_type_handler_registry - type_semantic_hasher_registry param/property → type_handler_registry - JSON config keys and _class values updated accordingly No logic changes. All 3717 tests pass.
…thonHasher in comments
| "description": "ObjectSpec for the file content hasher (used by PathContentHandler)" | ||
| "description": "ObjectSpec for the file content hasher (used by PathSemanticHasher)" | ||
| }, | ||
| "function_info_extractor": { |
There was a problem hiding this comment.
shouldn't this be changed to "function_semantic_hasher"?
There was a problem hiding this comment.
Fixed in commit 596333b — renamed the top-level named component from function_info_extractor to function_semantic_hasher in both contexts/data/v0.1.json and contexts/data/schemas/context_schema.json. All three {"_ref": "function_info_extractor"} entries in the registry handler configs were updated to {"_ref": "function_semantic_hasher"} accordingly.
| """Hasher for ``uuid.UUID`` objects — hashes the raw 16-byte binary representation.""" | ||
|
|
||
| def hash(self, obj: Any, hasher: "SemanticAwarePythonHasher") -> ContentHash: | ||
| return hasher.hash_object(obj.bytes) |
There was a problem hiding this comment.
for UUID, the bytes content should be directly returned as a ContentHash with kind 'uuid' (or any more specific info if available)
There was a problem hiding this comment.
Deferred — filed PLT-1828 to track this. Changing UUIDHandler to return ContentHash("uuid", obj.bytes) directly would alter recorded hash values and is closely related to the broader bytes/JSON representation strategy being revisited in PLT-1812. The two will be addressed together to avoid a split migration.
| return obj.hex() | ||
| raise TypeError(f"BytesHandler: expected bytes or bytearray, got {type(obj)!r}") | ||
|
|
||
| return hasher.hash_object(obj.hex()) |
There was a problem hiding this comment.
Shouldn't we just hash the bytes directly?
There was a problem hiding this comment.
Deferred — this is the same class of issue tracked in PLT-1812 (JSON-based hashing limitation with bytes representation) and PLT-1828 (UUID). All three — bytes, bytearray, and UUID — will be revisited together as part of the unified object-hashing strategy redesign to avoid a piecemeal hash-value migration.
| [{"_type": "pyarrow.Table"}, {"_class": "orcapod.hashing.semantic_hashing.builtin_handlers.ArrowTableHandler", "_config": {"arrow_hasher": {"_ref": "arrow_hasher"}}}], | ||
| [{"_type": "pyarrow.RecordBatch"}, {"_class": "orcapod.hashing.semantic_hashing.builtin_handlers.ArrowTableHandler", "_config": {"arrow_hasher": {"_ref": "arrow_hasher"}}}] | ||
| [{"_type": "builtins.bytes"}, {"_class": "orcapod.hashing.semantic_hashing.builtin_handlers.BytesHandler", "_config": {}}], | ||
| [{"_type": "builtins.bytearray"}, {"_class": "orcapod.hashing.semantic_hashing.builtin_handlers.BytesHandler", "_config": {}}], |
There was a problem hiding this comment.
Let's add a follow up issue to make it such that the same handler can be registerd to multiple target classes and make use of MRO-based matching system already used many other places in the codebase.
There was a problem hiding this comment.
Filed PLT-1827 to track this. No code change in this PR.
| "_class": "orcapod.hashing.arrow_hashers.StarfixArrowHasher", | ||
| "_config": { | ||
| "hasher_id": "arrow_v0.1", | ||
| "type_converter": {"_ref": "type_converter"}, |
There was a problem hiding this comment.
The fact arrow_hasher takes in type_converter and sematnic_hasher as its construtor argument actually makes arrow_hasher and sematic_hashser relationship circular in the default context. This strongly suggests we should unlink the circle by making one of them instantiate WITHOUT the other in the constructor. Rather, it should "optionally" accept the other (e.g. semantic_hasher) when invoking method on the arrow hasher.
There was a problem hiding this comment.
Filed PLT-1826 to track the decoupling. The likely fix is to remove type_converter and semantic_hasher from StarfixArrowHasher.__init__ and accept them only at method-invocation time (e.g. hash_table(table, *, type_converter=None, semantic_hasher=None)), resolving lazily from get_default_context() when omitted. No code change in this PR.
…er in v0.1 config
Review round summaryFixed (commit 596333b):
Deferred with dedicated issues:
|
| The MRO of ``type(obj)`` is walked from most-derived to least-derived | ||
| (i.e. the object's own class first, then its bases). The first | ||
| match found in the registry is returned. | ||
| def get_semantic_hasher(self, obj: Any) -> "PythonTypeHandler | None": |
There was a problem hiding this comment.
rename this back to get_handler
There was a problem hiding this comment.
Done in commit d71bf19 — renamed get_semantic_hasher back to get_handler. Updated all call sites in semantic_hasher.py, visitors.py, and both test files.
|
|
||
| Useful when the caller already has the type and wants to check | ||
| registration without constructing a dummy instance. | ||
| def get_semantic_hasher_for_type( |
There was a problem hiding this comment.
rename back to get_handler_for_type
There was a problem hiding this comment.
Done in commit d71bf19 — renamed get_semantic_hasher_for_type back to get_handler_for_type. Updated all call sites including the internal call inside has_handler and the test files.
| """ | ||
| Return True if a handler is registered for *target_type* or any of | ||
| its MRO ancestors. | ||
| def has_semantic_hasher(self, target_type: type) -> bool: |
There was a problem hiding this comment.
rename back to has_handler
There was a problem hiding this comment.
Done in commit d71bf19 — renamed has_semantic_hasher back to has_handler. Updated all call sites in visitors.py and both test files.
| hasher_id: str = _CURRENT_SEMANTIC_HASHER_ID, | ||
| strict: bool = True, | ||
| type_handler_registry: "hp.TypeHandlerRegistry | None" = None, # type: ignore[name-defined] | ||
| type_handler_registry: "Any | None" = None, |
There was a problem hiding this comment.
why does this have to be "Any"?
There was a problem hiding this comment.
Fixed in commit d71bf19 — the Any was a placeholder while the right abstraction was unclear. Now that HandlerRegistryProtocol is defined in hashing_protocols.py (see comment below), the parameter is typed as hp.HandlerRegistryProtocol | None. The unused Any import was also removed.
| def handle(self, obj: Any, hasher: "SemanticHasherProtocol") -> Any: | ||
| """ | ||
| Convert *obj* into a value that ``hash_object`` can process. | ||
| def handle(self, obj: Any, hasher: "SemanticAwarePythonHasher") -> Any: |
There was a problem hiding this comment.
We should create and use protocol that corresponds to SemanticAwarePythonHasher -- otherwise we are creating unnecessary concrete type dependency here.
There was a problem hiding this comment.
Fixed in commit d71bf19 — PythonTypeHandler.handle() now accepts SemanticHasherProtocol instead of the concrete SemanticAwarePythonHasher. The SemanticAwarePythonHasher import was removed from the TYPE_CHECKING block in hashing_protocols.py entirely.
There was a problem hiding this comment.
Already addressed — PythonTypeHandlerProtocol.handle() in hashing_protocols.py uses hasher: "SemanticHasherProtocol" as the parameter type (line 63 of the protocol definition). The concrete SemanticAwarePythonHasher is never referenced there.
| @property | ||
| def type_handler_registry(self) -> "TypeHandlerRegistry": | ||
| """Return the TypeHandlerRegistry used by this hasher.""" | ||
| def type_handler_registry(self) -> "PythonTypeHandlerRegistry": |
There was a problem hiding this comment.
Should PythonTypeHandlerRegistry here be defined in terms of Protocol that corresponds to HandlerRegistry instead?
There was a problem hiding this comment.
Done in commit d71bf19 — introduced HandlerRegistryProtocol in hashing_protocols.py that exposes get_handler(), get_handler_for_type(), and has_handler(). SemanticHasherProtocol.type_handler_registry now returns HandlerRegistryProtocol instead of the concrete PythonTypeHandlerRegistry. The PythonTypeHandlerRegistry import was removed from the TYPE_CHECKING block in hashing_protocols.py. The versioned_hashers.py Any annotation was also updated to HandlerRegistryProtocol | None as a result.
There was a problem hiding this comment.
Already addressed — SemanticHasherProtocol.type_handler_registry returns HandlerRegistryProtocol (line 123 of hashing_protocols.py). The concrete PythonTypeHandlerRegistry is not referenced in the protocol definition.
…ol, decouple type annotations - PythonTypeHandlerRegistry: rename get_semantic_hasher → get_handler, get_semantic_hasher_for_type → get_handler_for_type, has_semantic_hasher → has_handler; update all call sites - hashing_protocols: add HandlerRegistryProtocol abstracting over the concrete registry; SemanticHasherProtocol.type_handler_registry now returns HandlerRegistryProtocol instead of PythonTypeHandlerRegistry; PythonTypeHandler.handle() now uses SemanticHasherProtocol instead of the concrete SemanticAwarePythonHasher; remove concrete-class imports from TYPE_CHECKING block - versioned_hashers: type type_handler_registry param as HandlerRegistryProtocol | None instead of Any | None; drop unused Any import - Update test_hashing.py and test_semantic_hasher.py for renamed methods Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Round 2 review — all comments addressed in commit d71bf19
All 4269 tests pass (56 skipped, 13 xfailed). |
| the single source of truth for versioned component configuration. | ||
| """ | ||
|
|
||
| def __init__( |
There was a problem hiding this comment.
shouldn't this be SemanticHasherProtocol?
There was a problem hiding this comment.
Fixed — content_identifiable_mixin.py now imports and uses SemanticHasherProtocol (from orcapod.protocols.hashing_protocols) everywhere the concrete SemanticAwarePythonHasher was referenced: the __init__ parameter type, the _get_hasher() return type, and all docstring references.
| self, | ||
| hasher_id: str, | ||
| type_handler_registry: TypeHandlerRegistry | None = None, | ||
| type_handler_registry: PythonTypeHandlerRegistry | None = None, |
There was a problem hiding this comment.
can this not be the protocol?
There was a problem hiding this comment.
Fixed — SemanticAwarePythonHasher.__init__ now accepts type_handler_registry: HandlerRegistryProtocol | None = None. The stored _registry attribute is typed as HandlerRegistryProtocol so the concrete PythonTypeHandlerRegistry (which satisfies the protocol) still works, as does any other conforming registry.
| @property | ||
| def type_handler_registry(self) -> TypeHandlerRegistry: | ||
| """Return the ``TypeHandlerRegistry`` used by this hasher.""" | ||
| def type_handler_registry(self) -> PythonTypeHandlerRegistry: |
There was a problem hiding this comment.
If we are going to take in protocol, this should be protocol being returned
There was a problem hiding this comment.
Fixed — the type_handler_registry property now has return type HandlerRegistryProtocol (imported from orcapod.protocols.hashing_protocols), consistent with the parameter change in __init__.
| from orcapod.protocols.hashing_protocols import ( | ||
| ArrowHasherProtocol, | ||
| TypeHandlerProtocol, | ||
| PythonTypeHandler, |
There was a problem hiding this comment.
Explicitly call this "PythonTypeHandlerProtocol"
There was a problem hiding this comment.
Fixed — renamed PythonTypeHandler → PythonTypeHandlerProtocol in hashing_protocols.py. Updated all references across type_handler_registry.py, semantic_hasher.py, builtin_handlers.py, hashing/__init__.py, and semantic_hashing/__init__.py. As a related sweep, CallableWithPod in function_pod.py was also renamed to CallableWithPodProtocol to satisfy the project-wide 'all Protocol subclasses end with Protocol' convention. The strict-mode error message was also refined to read "no implementation of PythonTypeHandlerProtocol registered".
…ncrete types - Rename PythonTypeHandler → PythonTypeHandlerProtocol everywhere: class definition in hashing_protocols.py, all type annotations in type_handler_registry.py, hashing/__init__.py export, and all docstring references across builtin_handlers.py and semantic_hashing/__init__.py - Rename CallableWithPod → CallableWithPodProtocol in function_pod.py - SemanticAwarePythonHasher.__init__ now accepts HandlerRegistryProtocol | None instead of PythonTypeHandlerRegistry | None; drop concrete-class import - SemanticAwarePythonHasher.type_handler_registry property now returns HandlerRegistryProtocol instead of PythonTypeHandlerRegistry - ContentIdentifiableMixin now imports and uses SemanticHasherProtocol instead of the concrete SemanticAwarePythonHasher for __init__ param and _get_hasher return type - Update strict-mode error messages to say "no implementation of PythonTypeHandlerProtocol registered"; update matching test assertions Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Round 3 review — changes madeAll four comments addressed in commit ContentIdentifiableMixin uses SemanticHasherProtocol ( SemanticAwarePythonHasher accepts / returns HandlerRegistryProtocol ( Rename PythonTypeHandler → PythonTypeHandlerProtocol ( Bonus: enforce project-wide Protocol naming convention Refined strict-mode error message All 4269 tests pass, 56 skipped, 13 xfailed. |
| """Hasher for ``types.UnionType`` objects (Python 3.10+ ``X | Y`` syntax).""" | ||
|
|
||
| def handle(self, obj: Any, hasher: "SemanticHasherProtocol") -> Any: | ||
| def handle(self, obj: Any, hasher: "SemanticAwarePythonHasher") -> Any: |
There was a problem hiding this comment.
Shouldn't this also take in a protocol?
There was a problem hiding this comment.
Fixed — UnionTypeHandler.handle() now takes hasher: "SemanticHasherProtocol" (along with all other built-in handler handle() methods — see the "make all handle take in a protocol" comment).
| from orcapod.contexts import get_default_context | ||
| return get_default_context().arrow_hasher # type: ignore[return-value] | ||
|
|
||
| def handle(self, obj: Any, hasher: "SemanticAwarePythonHasher") -> ContentHash: |
There was a problem hiding this comment.
make this take in protocol
There was a problem hiding this comment.
Fixed — ArrowTableHandler.handle() now takes hasher: "SemanticHasherProtocol". Addressed together with all other handle() method signatures.
| """ | ||
|
|
||
| def handle(self, obj: Any, hasher: "SemanticHasherProtocol") -> Any: | ||
| def handle(self, obj: Any, hasher: "SemanticAwarePythonHasher") -> Any: |
There was a problem hiding this comment.
make all handle take in a protocol
There was a problem hiding this comment.
Fixed — all 11 handle() method signatures across every built-in handler class now use hasher: "SemanticHasherProtocol". The concrete SemanticAwarePythonHasher import has been removed from the TYPE_CHECKING block entirely; SemanticHasherProtocol is imported from orcapod.protocols.hashing_protocols instead.
|
|
||
| def register_builtin_handlers( | ||
| registry: "TypeHandlerRegistry", | ||
| def register_builtin_python_type_handlers( |
There was a problem hiding this comment.
this registry should be protocol
There was a problem hiding this comment.
Fixed — register_builtin_python_type_handlers() now takes registry: "HandlerRegistryProtocol" (imported from orcapod.protocols.hashing_protocols). The PythonTypeHandlerRegistry import in the TYPE_CHECKING block has been removed. The docstring was updated to match.
| With an injected hasher (e.g. in tests):: | ||
|
|
||
| hasher = BaseSemanticHasher(hasher_id="test", strict=True) | ||
| hasher = SemanticHasherProtocol(hasher_id="test", strict=True) |
There was a problem hiding this comment.
This makes no sense -- you cannot be instantiating protocol.
There was a problem hiding this comment.
Fixed — the docstring example now shows from orcapod.hashing.semantic_hashing.semantic_hasher import SemanticAwarePythonHasher followed by hasher = SemanticAwarePythonHasher(hasher_id="test", strict=True). SemanticHasherProtocol is a structural Protocol and cannot be instantiated directly.
- Replace all `handle()` hasher params with SemanticHasherProtocol (was SemanticAwarePythonHasher) across all 11 builtin handler classes - Change register_builtin_python_type_handlers() registry param from PythonTypeHandlerRegistry to HandlerRegistryProtocol - Remove concrete-class imports from TYPE_CHECKING block; import SemanticHasherProtocol and HandlerRegistryProtocol from protocols module - Fix content_identifiable_mixin.py docstring example that incorrectly showed SemanticHasherProtocol being instantiated; replace with SemanticAwarePythonHasher (the concrete class) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Round 4 review — changes madeAll five comments addressed in commit All
Docstring example fixed ( Also replied to two comments from review 4566882270 ( All 4269 tests pass, 56 skipped, 13 xfailed. |
| def register_builtin_python_type_handlers( | ||
| registry: "HandlerRegistryProtocol", | ||
| file_hasher: Any = None, | ||
| function_info_extractor: Any = None, | ||
| arrow_hasher: "ArrowHasherProtocol | None" = None, |
There was a problem hiding this comment.
Fixed — added register() and __len__() to HandlerRegistryProtocol in hashing_protocols.py. The protocol now declares the full surface used by register_builtin_python_type_handlers(): lookup (get_handler, get_handler_for_type, has_handler), mutation (register), and size (__len__).
| """Tests for UUIDHandler hash() method behaviour. | ||
|
|
||
| Verifies that UUIDHandler returns the 16-byte binary representation of a | ||
| UUID, consistent with OrcaPod's canonical ``pa.binary(16)`` Arrow storage | ||
| format. | ||
| Verifies that UUIDHandler produces a ContentHash based on the 16-byte | ||
| binary representation of a UUID, consistent with OrcaPod's canonical | ||
| ``pa.binary(16)`` Arrow storage format. |
There was a problem hiding this comment.
Fixed — updated the module docstring from "Tests for UUIDHandler hash() method behaviour" to "Tests for UUIDHandler handle() dispatch via SemanticAwarePythonHasher". UUIDHandler implements handle(), not hash(), and the tests exercise SemanticAwarePythonHasher.hash_object() dispatch.
- Add register() and __len__() to HandlerRegistryProtocol so the protocol matches every method called on registry inside register_builtin_python_type_handlers(); previously HandlerRegistryProtocol only declared the lookup side of the interface (get_handler, get_handler_for_type, has_handler), leaving register() and len() untyped - Fix test_uuid_handler.py module docstring: s/hash() method behaviour/ handle() dispatch via SemanticAwarePythonHasher/ — UUIDHandler implements handle(), not hash(), and the tests exercise hash_object() dispatch Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Round 5 review (Copilot) — changes madeBoth comments addressed in commit
Test module docstring corrected ( All 4269 tests pass, 56 skipped, 13 xfailed. |
| ------- | ||
| SemanticHasherProtocol | ||
| A fully configured SemanticHasherProtocol instance. | ||
| Optional ``PythonTypeHandlerRegistry`` to inject. When None the |
There was a problem hiding this comment.
update this doc string to match the function input parameter types. Make sure the same is done everywhere
There was a problem hiding this comment.
Fixed in commit b63ff2d. Updated all four places where docstrings/annotations referenced concrete type names instead of protocol names:
versioned_hashers.py: summary lineSemanticAwarePythonHasher→SemanticHasherProtocol;type_handler_registryparam descriptionPythonTypeHandlerRegistry→HandlerRegistryProtocolarrow_hashers.py:semantic_hasherparam annotation and docstringSemanticAwarePythonHasher→SemanticHasherProtocol; updatedTYPE_CHECKINGimportvisitors.py: same forpython_hasherparam inSemanticHashingVisitordefaults.py: "owned by the activeSemanticAwarePythonHasher" → "owned by the activeSemanticHasherProtocol"
Update all docstrings and type annotations to consistently use protocol
types instead of concrete implementation types:
- versioned_hashers.py: fix summary ("SemanticAwarePythonHasher" →
"SemanticHasherProtocol") and type_handler_registry param description
("PythonTypeHandlerRegistry" → "HandlerRegistryProtocol") to match
the hp.HandlerRegistryProtocol annotation already on the parameter
- arrow_hashers.py: change semantic_hasher param annotation and docstring
from SemanticAwarePythonHasher to SemanticHasherProtocol; update
TYPE_CHECKING import accordingly
- visitors.py: same for python_hasher param in SemanticHashingVisitor
- defaults.py: update "owned by SemanticAwarePythonHasher" to
"owned by SemanticHasherProtocol" in get_default_python_type_handler_registry
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Round 6 review — response summaryAddressed eywalker's comment about docstring/annotation mismatches with a codebase-wide sweep. Changes made (commit b63ff2d)
The codebase now consistently uses protocol types ( |
Summary
SemanticTypeRegistrysystem (semantic_registry.py,semantic_struct_converters.py,SemanticStructConverterProtocol, alltests/test_semantic_types/files) — ~2,650 lines removedSemanticHashingVisitorto dispatch onpa.ExtensionType(via newvisit_extension()hook) instead of struct-field signature matching; hash tokens encoded as<type_name>::<method>:<digest>bytesTypeHandlerProtocol→PythonTypeSemanticHasherProtocol,TypeHandlerRegistry→PythonTypeSemanticHasherRegistry,BaseSemanticHasher→SemanticAwarePythonHasher,*Handler→*SemanticHasher); deletedSemanticArrowHasher, finalizedStarfixArrowHasheras the sole Arrow hasherv0.1.jsondefault context: removedsemantic_registry, addedpython_type_semantic_hasher_registry, wired bothtype_converterandsemantic_hasherrefs intoarrow_hasherTest plan
git grep SemanticTypeRegistry src/— zero hits (only changelog string)git grep TypeHandlerRegistry src/— zero hitsgit grep BaseSemanticHasher src/— zero hitsget_default_arrow_hasher()confirmed working (brokenset_cachercall fixed)tests/test_hashing/test_extension_type_hashing.py— dispatch order, encoding format, null passthrough, unregistered extension type passthroughLinear
Fixes PLT-1660