Skip to content

feat(PLT-1660): hard cut — delete old semantic type system, wire extension type hashing#182

Merged
eywalker merged 33 commits into
extension-type-systemfrom
eywalker/plt-1660-hard-cut-delete-old-semantic-type-system-and-wire-in
Jun 25, 2026
Merged

feat(PLT-1660): hard cut — delete old semantic type system, wire extension type hashing#182
eywalker merged 33 commits into
extension-type-systemfrom
eywalker/plt-1660-hard-cut-delete-old-semantic-type-system-and-wire-in

Conversation

@kurodo3

@kurodo3 kurodo3 Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Deleted the shape-based SemanticTypeRegistry system (semantic_registry.py, semantic_struct_converters.py, SemanticStructConverterProtocol, all tests/test_semantic_types/ files) — ~2,650 lines removed
  • Rewired SemanticHashingVisitor to dispatch on pa.ExtensionType (via new visit_extension() hook) instead of struct-field signature matching; hash tokens encoded as <type_name>::<method>:<digest> bytes
  • Renamed all hashing protocol/registry/hasher classes to cleaner names (TypeHandlerProtocolPythonTypeSemanticHasherProtocol, TypeHandlerRegistryPythonTypeSemanticHasherRegistry, BaseSemanticHasherSemanticAwarePythonHasher, *Handler*SemanticHasher); deleted SemanticArrowHasher, finalized StarfixArrowHasher as the sole Arrow hasher
  • Updated v0.1.json default context: removed semantic_registry, added python_type_semantic_hasher_registry, wired both type_converter and semantic_hasher refs into arrow_hasher

Test plan

  • 3717 tests pass, 56 skipped, 6 xfailed — zero failures
  • git grep SemanticTypeRegistry src/ — zero hits (only changelog string)
  • git grep TypeHandlerRegistry src/ — zero hits
  • git grep BaseSemanticHasher src/ — zero hits
  • get_default_arrow_hasher() confirmed working (broken set_cacher call fixed)
  • New tests: tests/test_hashing/test_extension_type_hashing.py — dispatch order, encoding format, null passthrough, unregistered extension type passthrough

Linear

Fixes PLT-1660

kurodo3 Bot and others added 21 commits June 24, 2026 10:46
Covers: visitor visit_extension dispatch, SemanticHashingVisitor rewrite,
StarfixArrowHasher constructor update, renames (BaseSemanticHasher ->
SemanticAwarePythonHasher, TypeHandlerRegistry -> PythonTypeHandlerRegistry),
v0.1.json / context_schema.json changes, and deletion plan for old
SemanticTypeRegistry / SemanticStructConverter files.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Folds in: TypeHandlerProtocol.handle()->Any tightened to
PythonTypeSemanticHasherProtocol.hash()->ContentHash; all builtin
handlers renamed to *SemanticHasher and updated to return ContentHash
directly; registry renamed to PythonTypeSemanticHasherRegistry with
updated method names.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…nd colon namespace

Extension name dots replaced with colons (orcapod.path -> orcapod:path);
"::" used as separator between type prefix and hash so the boundary is
unambiguous (to_prefixed_digest uses only single ":").

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…emanticHasherProtocol, tighten hash() → ContentHash
…ythonHasher, simplify dispatch

Also updates defaults.py: replaces get_default_type_handler_registry with
get_default_python_type_semantic_hasher_registry to match the new registry API.
…ix hash() return type

- Renamed _DummyHandler class to _DummySemanticHasher for clarity
- Changed hash() method to return ContentHash via hasher.hash_object() instead of raw string
- Updated all 13 usages throughout the test class

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…Visitor for extension types

- Add visit_extension() to ArrowTypeDataVisitor with passthrough default
- visit() now checks for pa.ExtensionType BEFORE struct check to prevent
  extension types with struct storage being swallowed by visit_struct
- Rewrite SemanticHashingVisitor to use type_converter + python_hasher
  instead of semantic_registry; resolves extension types via the logical
  type registry and produces pa.large_binary() tokens of the form
  <ext_name>::<method>:<digest>
- Update StarfixArrowHasher constructor to accept type_converter instead
  of semantic_registry; python_hasher resolved lazily from context to
  break the circular dependency in the JSON spec
- Update v0.1.json component ordering so type_converter is created before
  arrow_hasher (which now requires it)
- Update versioned_hashers.py, test_starfix_arrow_hasher.py, and
  test_semantic_registry.py to use the new API
- Add tests/test_hashing/test_extension_type_hashing.py with 6 tests
  covering dispatch routing, hash stability, null passthrough, and binary
  encoding format

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… import

- Fix test_visit_dispatches_to_visit_extension_for_extension_types to use a real
  file (via tmp_path fixture) and call super() in visit_extension to validate the
  full dispatch chain
- Move deferred 'from typing import Any' to module-level import at top of
  visitors.py and use typing.Any in visit_extension method

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ArrowHasher constructor

- Deleted SemanticArrowHasher class (old struct-based arrow hasher)
- Renamed python_hasher parameter to semantic_hasher (required positional)
- Removed lazy resolution logic (_get_python_hasher) — semantic_hasher is now required
- Removed unused imports: arrow_serialization, arrow_utils, SemanticTypeRegistry

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Added semantic_hasher=ctx.semantic_hasher to _make_hasher()
- Moved get_default_context import inside _make_hasher() (no top-level import needed)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…semantic_registry

- Rewrote v0.1.json: removed semantic_registry and type_handler_registry keys
- Added python_type_semantic_hasher_registry key with all type handlers
- arrow_hasher now wires in both type_converter and semantic_hasher refs
- pa.Table/pa.RecordBatch handlers added back using lazy arrow_hasher resolution
  to break the circular dep (ArrowTableSemanticHasher now accepts optional arg)
- context_schema.json: removed semantic_registry property, renamed
  type_handler_registry -> python_type_semantic_hasher_registry
- versioned_hashers.py: get_versioned_semantic_arrow_hasher() now sources both
  type_converter and semantic_hasher from default context via resolve_context()

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…-based hashing system

- Deleted src/orcapod/semantic_types/semantic_registry.py
- Deleted src/orcapod/semantic_types/semantic_struct_converters.py
- Removed SemanticTypeRegistry export from semantic_types/__init__.py
- Removed SemanticStructConverterProtocol from protocols/semantic_types_protocols.py
- Deleted tests/test_hashing/test_file_hashing_consistency.py (used SemanticArrowHasher)
- Deleted tests/test_semantic_types/ directory (tested deleted classes)
- Updated docstrings/comments to remove old class name references
- ArrowTableSemanticHasher: made arrow_hasher optional with lazy context resolution
  to break the circular dep (registry -> ArrowTableSemanticHasher -> arrow_hasher -> registry)
- context_schema.json: updated descriptions and examples to use new class names

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…est, fix stale log message

- get_default_arrow_hasher(): remove broken set_cacher() call and cache_file_hash param;
  StarfixArrowHasher has no set_cacher method. Replaced with a simple delegate to
  get_default_context().arrow_hasher.
- semantic_hasher.py: update stale log message from SemanticHasherProtocol (non-strict)
  to SemanticAwarePythonHasher (non-strict) with more descriptive text.
- test_extension_type_hashing.py: add test_unregistered_python_type_passes_through to
  TestSemanticHashingVisitorExtension covering the branch where extension type is
  recognized but has no semantic hasher registered.

Note: Fix 2 (remove pa.Table/pa.RecordBatch from v0.1.json) was not applied because
Datagram.identity_structure() explicitly depends on ArrowTableSemanticHasher being
registered to hash pa.Table objects (documented in datagram.py docstring). Removing
these entries breaks 1 test (test_merge_join) and the fundamental design. The lazy
context resolution in ArrowTableSemanticHasher._get_arrow_hasher() already handles
the circular dependency concern raised in the review.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@codecov

codecov Bot commented Jun 24, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 90.90909% with 10 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...capod/hashing/semantic_hashing/builtin_handlers.py 87.50% 5 Missing ⚠️
src/orcapod/hashing/visitors.py 88.88% 3 Missing ⚠️
src/orcapod/hashing/defaults.py 75.00% 1 Missing ⚠️
.../hashing/semantic_hashing/type_handler_registry.py 88.88% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR removes the legacy shape-based “semantic type” system and migrates Arrow semantic hashing to dispatch on pa.ExtensionType, using the extension-type system as the source of truth. It also tightens/renames the Python semantic hashing APIs (handlers → semantic hashers) and updates the default v0.1 data context wiring accordingly.

Changes:

  • Deleted the old SemanticTypeRegistry + struct-signature semantic converters and their test suite.
  • Reworked Arrow semantic hashing to hash extension-typed column values via a new visit_extension() hook, and wired StarfixArrowHasher to use type_converter + semantic_hasher.
  • Renamed/standardized semantic hashing interfaces (TypeHandler*PythonTypeSemanticHasher*, handle()hash() -> ContentHash) and updated contexts/tests.

Reviewed changes

Copilot reviewed 35 out of 35 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
tests/test_semantic_types/test_uuid_struct_converter.py Deleted legacy semantic-struct UUID converter tests.
tests/test_semantic_types/test_upath_struct_converter.py Deleted legacy semantic-struct UPath converter tests.
tests/test_semantic_types/test_universal_converter.py Deleted legacy “semantic_types” test coverage (converter now driven by extension types elsewhere).
tests/test_semantic_types/test_semantic_struct_converters.py Deleted tests for legacy SemanticStructConverterBase.
tests/test_semantic_types/test_semantic_registry.py Deleted tests for legacy SemanticTypeRegistry.
tests/test_semantic_types/test_schema_arrow_equality.py Deleted schema/logical-equality tests tied to the removed semantic registry.
tests/test_semantic_types/test_pydata_utils.py Deleted tests for legacy semantic_types pydata utils module usage.
tests/test_semantic_types/test_path_struct_converter.py Deleted legacy semantic-struct Path converter tests.
tests/test_hashing/test_uuid_handler.py Updated UUID hashing tests to use the new semantic hasher API.
tests/test_hashing/test_starfix_arrow_hasher.py Updated StarfixArrowHasher tests to use default context wiring (type_converter + semantic_hasher).
tests/test_hashing/test_semantic_hasher.py Updated comprehensive semantic-hasher tests for renamed protocols/registry and hash() semantics.
tests/test_hashing/test_file_hashing_consistency.py Deleted integration tests that depended on the removed struct-based path hashing.
tests/test_hashing/test_extension_type_hashing.py Added tests validating extension-type dispatch and hash-token encoding/passthrough behavior.
superpowers/specs/2026-06-24-plt-1660-hard-cut-extension-type-hashing.md Added detailed design/spec doc for the migration and API renames.
src/orcapod/semantic_types/semantic_struct_converters.py Deleted legacy struct-based semantic converters.
src/orcapod/semantic_types/semantic_registry.py Deleted legacy semantic registry implementation.
src/orcapod/semantic_types/init.py Removed legacy SemanticTypeRegistry export.
src/orcapod/protocols/semantic_types_protocols.py Removed SemanticStructConverterProtocol (legacy).
src/orcapod/protocols/hashing_protocols.py Replaced TypeHandler protocol with PythonTypeSemanticHasherProtocol.hash() -> ContentHash and renamed registry accessor.
src/orcapod/hashing/visitors.py Added extension-type dispatch and rewrote SemanticHashingVisitor to hash extension-typed values.
src/orcapod/hashing/versioned_hashers.py Updated versioned factories to build StarfixArrowHasher from the default context.
src/orcapod/hashing/semantic_hashing/type_handler_registry.py Renamed registry and methods to semantic-hasher terminology.
src/orcapod/hashing/semantic_hashing/semantic_hasher.py Renamed semantic hasher and simplified dispatch to single-step semantic_hasher.hash().
src/orcapod/hashing/semantic_hashing/content_identifiable_mixin.py Updated mixin docs/types for renamed semantic hasher class.
src/orcapod/hashing/semantic_hashing/builtin_handlers.py Renamed built-ins to *SemanticHasher and updated to return ContentHash.
src/orcapod/hashing/semantic_hashing/init.py Updated re-exports for renamed semantic hashing classes/functions.
src/orcapod/hashing/defaults.py Renamed default-registry accessor and simplified arrow-hasher getter.
src/orcapod/hashing/arrow_hashers.py Removed legacy hasher and rewired StarfixArrowHasher to semantic-visit extension types.
src/orcapod/hashing/init.py Updated hashing public API exports for renamed types/registry/protocols.
src/orcapod/extension_types/registry.py Removed outdated doc reference to SemanticTypeRegistry.
src/orcapod/core/datagrams/datagram.py Updated docs to reference ArrowTableSemanticHasher.
src/orcapod/contexts/data/v0.1.json Removed semantic_registry; added python_type_semantic_hasher_registry; wired arrow_hasher to type_converter + semantic_hasher.
src/orcapod/contexts/data/schemas/context_schema.json Removed semantic_registry; renamed registry property; updated example wiring.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 138 to 145
for item in data:
current_element_type, processed_item = self.visit(element_type, item)
processed_elements.append(processed_item)

# Use the first non-None element to determine new element type
if new_element_type is None:
new_element_type = current_element_type

# If list was empty or all None, keep original element type
if new_element_type is None:
new_element_type = element_type

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partially fixed in commit d29079c — changed if new_element_type is None: to if new_element_type is None and processed_item is not None: so the element type is inferred from the first non-null processed element rather than the first element regardless of nullness.

The deeper fix (determining the output element type from the Arrow type alone via a type-probe before the data loop, rather than inferring from data values at all) requires a coupled change to visit_extension null handling and has been tracked in PLT-1809 to be addressed when composite extension type support (List/Struct logical types) lands.

Comment on lines 110 to 114
self, struct_type: "pa.StructType", data: dict | None
) -> tuple["pa.StructType", dict]:
"""
Helper method to recursively process struct fields.

This is the default behavior for regular (non-semantic) structs.
"""
"""Recursively process struct fields. Default behavior for regular structs."""
if data is None:
return struct_type, None

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in commit d29079c — updated the return annotation from tuple["pa.StructType", dict] to tuple["pa.StructType", dict | None] in both the base class _visit_struct_fields and the SemanticHashingVisitor override. Both already returned (struct_type, None) on null input.

Comment on lines 128 to 132
self, list_type: "pa.ListType", data: list | None
) -> tuple["pa.DataType", list]:
"""
Helper method to recursively process list elements.

This is the default behavior for lists.
"""
"""Recursively process list elements."""
if data is None:
return list_type, None

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in commit d29079c — updated the return annotation from tuple["pa.DataType", list] to tuple["pa.DataType", list | None] in the base class _visit_list_elements. The function already returned (list_type, None) when data is None; the annotation now accurately reflects that.

Comment on lines +226 to +230
hash_bytes = (
type_name.encode("ascii")
+ b"::"
+ content_hash.to_prefixed_digest()
)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in commit d29079c — changed encode("ascii") to encode("utf-8"). UTF-8 is a strict superset of ASCII so this is a no-op for all current built-in extension names, but correctly handles user-defined non-ASCII extension names without raising UnicodeEncodeError. The b"::" separator remains unambiguous in UTF-8 since non-ASCII bytes always have the high bit set and can never produce the two-byte sequence 0x3a 0x3a.

Comment thread src/orcapod/hashing/semantic_hashing/builtin_handlers.py Outdated
@kurodo3

kurodo3 Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor Author

Review response — commit d29079c

Thanks for the thorough review. All 5 comments addressed:

# Issue Resolution
1 encode("ascii") can raise UnicodeEncodeError for non-ASCII extension names Changed to encode("utf-8") in visitors.py
2 _visit_list_elements return annotation allows None but declared list Updated to list | None
3 _visit_struct_fields return annotation allows None but declared dict Updated to dict | None in both base class and SemanticHashingVisitor override
4 List element type inferred from first element even when None Fixed: type now inferred from first non-null processed element. The deeper fix (determining output type from the Arrow type alone, before any data iteration) requires a coupled change to visit_extension null handling and has been tracked in PLT-1809 for when composite extension type support (List/Struct logical types) lands
5 register_builtin_python_type_semantic_hashers skips pa.Table/pa.RecordBatch when arrow_hasher is None Removed the guard — ArrowTableSemanticHasher is now always registered; lazy resolution via get_default_context() handles the None case at call time rather than construction time

All 3717 tests pass.

kurodo3 Bot and others added 2 commits June 24, 2026 22:18
…return Any

Handlers now return a representative Python structure instead of a
ContentHash. SemanticAwarePythonHasher.hash_object() feeds the result
back into hash_object() for final hashing, treating a returned
ContentHash as a terminal (no re-hashing).

Simple built-in handlers (UUID, Bytes, Function, TypeObject,
SpecialForm, GenericAlias, UnionType) are simplified to return plain
Python values/structures. Semantic handlers that compute content-based
hashes from external data (Path, UPath, ArrowTable) continue to return
ContentHash directly, which short-circuits hashing as before.

Hash values are preserved: the extra hash_object() call is a no-op for
the simple handlers since the structure they return is identical to what
they previously delegated to hash_object() internally.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…peHandler, hash() → handle()

The protocol is now called PythonTypeHandler with a handle() method,
more clearly reflecting its role as a type-specific handler that returns
a representative Python structure rather than computing a ContentHash
directly.

All built-in handlers, the registry, the dispatch in
SemanticAwarePythonHasher, and all test helpers are updated accordingly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
kurodo3 Bot added 2 commits June 24, 2026 23:16
…ticHasherRegistry → PythonTypeHandlerRegistry

Mechanical rename across all source files, JSON configs, and tests:
- PathSemanticHasher → PathHandler, UPathSemanticHasher → UPathHandler,
  UUIDSemanticHasher → UUIDHandler, BytesSemanticHasher → BytesHandler,
  FunctionSemanticHasher → FunctionHandler, TypeObjectSemanticHasher → TypeObjectHandler,
  SpecialFormSemanticHasher → SpecialFormHandler,
  GenericAliasSemanticHasher → GenericAliasHandler,
  UnionTypeSemanticHasher → UnionTypeHandler,
  ArrowTableSemanticHasher → ArrowTableHandler,
  SchemaSemanticHasher → SchemaHandler
- register_builtin_python_type_semantic_hashers → register_builtin_python_type_handlers
- PythonTypeSemanticHasherRegistry → PythonTypeHandlerRegistry
- BuiltinPythonTypeSemanticHasherRegistry → BuiltinPythonTypeHandlerRegistry
- get_default_python_type_semantic_hasher_registry → get_default_python_type_handler_registry
- type_semantic_hasher_registry param/property → type_handler_registry
- JSON config keys and _class values updated accordingly

No logic changes. All 3717 tests pass.
"description": "ObjectSpec for the file content hasher (used by PathContentHandler)"
"description": "ObjectSpec for the file content hasher (used by PathSemanticHasher)"
},
"function_info_extractor": {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be changed to "function_semantic_hasher"?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in commit 596333b — renamed the top-level named component from function_info_extractor to function_semantic_hasher in both contexts/data/v0.1.json and contexts/data/schemas/context_schema.json. All three {"_ref": "function_info_extractor"} entries in the registry handler configs were updated to {"_ref": "function_semantic_hasher"} accordingly.

"""Hasher for ``uuid.UUID`` objects — hashes the raw 16-byte binary representation."""

def hash(self, obj: Any, hasher: "SemanticAwarePythonHasher") -> ContentHash:
return hasher.hash_object(obj.bytes)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for UUID, the bytes content should be directly returned as a ContentHash with kind 'uuid' (or any more specific info if available)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deferred — filed PLT-1828 to track this. Changing UUIDHandler to return ContentHash("uuid", obj.bytes) directly would alter recorded hash values and is closely related to the broader bytes/JSON representation strategy being revisited in PLT-1812. The two will be addressed together to avoid a split migration.

return obj.hex()
raise TypeError(f"BytesHandler: expected bytes or bytearray, got {type(obj)!r}")

return hasher.hash_object(obj.hex())

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we just hash the bytes directly?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deferred — this is the same class of issue tracked in PLT-1812 (JSON-based hashing limitation with bytes representation) and PLT-1828 (UUID). All three — bytes, bytearray, and UUID — will be revisited together as part of the unified object-hashing strategy redesign to avoid a piecemeal hash-value migration.

[{"_type": "pyarrow.Table"}, {"_class": "orcapod.hashing.semantic_hashing.builtin_handlers.ArrowTableHandler", "_config": {"arrow_hasher": {"_ref": "arrow_hasher"}}}],
[{"_type": "pyarrow.RecordBatch"}, {"_class": "orcapod.hashing.semantic_hashing.builtin_handlers.ArrowTableHandler", "_config": {"arrow_hasher": {"_ref": "arrow_hasher"}}}]
[{"_type": "builtins.bytes"}, {"_class": "orcapod.hashing.semantic_hashing.builtin_handlers.BytesHandler", "_config": {}}],
[{"_type": "builtins.bytearray"}, {"_class": "orcapod.hashing.semantic_hashing.builtin_handlers.BytesHandler", "_config": {}}],

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a follow up issue to make it such that the same handler can be registerd to multiple target classes and make use of MRO-based matching system already used many other places in the codebase.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed PLT-1827 to track this. No code change in this PR.

"_class": "orcapod.hashing.arrow_hashers.StarfixArrowHasher",
"_config": {
"hasher_id": "arrow_v0.1",
"type_converter": {"_ref": "type_converter"},

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact arrow_hasher takes in type_converter and sematnic_hasher as its construtor argument actually makes arrow_hasher and sematic_hashser relationship circular in the default context. This strongly suggests we should unlink the circle by making one of them instantiate WITHOUT the other in the constructor. Rather, it should "optionally" accept the other (e.g. semantic_hasher) when invoking method on the arrow hasher.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed PLT-1826 to track the decoupling. The likely fix is to remove type_converter and semantic_hasher from StarfixArrowHasher.__init__ and accept them only at method-invocation time (e.g. hash_table(table, *, type_converter=None, semantic_hasher=None)), resolving lazily from get_default_context() when omitted. No code change in this PR.

@kurodo3

kurodo3 Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor Author

Review round summary

Fixed (commit 596333b):

  • Renamed the named component function_info_extractorfunction_semantic_hasher in v0.1.json and context_schema.json, including all three _ref entries that pointed to it.

Deferred with dedicated issues:

  • UUID hashing (UUIDHandler → return ContentHash directly) — filed PLT-1828, linked to PLT-1812. Will be addressed together with the bytes representation strategy to avoid a split hash-value migration.
  • Bytes hashing (BytesHandler → hash raw bytes instead of hex string) — tracked in PLT-1812.
  • Multi-target handler registration (single handler for multiple types + MRO matching) — filed PLT-1827.
  • Circular dependency between StarfixArrowHasher and semantic_hasher/type_converter — filed PLT-1826. Proposed fix is to remove both from the constructor and accept them as optional keyword arguments at hash_table() call time, with lazy fallback to get_default_context().

The MRO of ``type(obj)`` is walked from most-derived to least-derived
(i.e. the object's own class first, then its bases). The first
match found in the registry is returned.
def get_semantic_hasher(self, obj: Any) -> "PythonTypeHandler | None":

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename this back to get_handler

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in commit d71bf19 — renamed get_semantic_hasher back to get_handler. Updated all call sites in semantic_hasher.py, visitors.py, and both test files.


Useful when the caller already has the type and wants to check
registration without constructing a dummy instance.
def get_semantic_hasher_for_type(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename back to get_handler_for_type

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in commit d71bf19 — renamed get_semantic_hasher_for_type back to get_handler_for_type. Updated all call sites including the internal call inside has_handler and the test files.

"""
Return True if a handler is registered for *target_type* or any of
its MRO ancestors.
def has_semantic_hasher(self, target_type: type) -> bool:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename back to has_handler

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in commit d71bf19 — renamed has_semantic_hasher back to has_handler. Updated all call sites in visitors.py and both test files.

hasher_id: str = _CURRENT_SEMANTIC_HASHER_ID,
strict: bool = True,
type_handler_registry: "hp.TypeHandlerRegistry | None" = None, # type: ignore[name-defined]
type_handler_registry: "Any | None" = None,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does this have to be "Any"?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in commit d71bf19 — the Any was a placeholder while the right abstraction was unclear. Now that HandlerRegistryProtocol is defined in hashing_protocols.py (see comment below), the parameter is typed as hp.HandlerRegistryProtocol | None. The unused Any import was also removed.

def handle(self, obj: Any, hasher: "SemanticHasherProtocol") -> Any:
"""
Convert *obj* into a value that ``hash_object`` can process.
def handle(self, obj: Any, hasher: "SemanticAwarePythonHasher") -> Any:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should create and use protocol that corresponds to SemanticAwarePythonHasher -- otherwise we are creating unnecessary concrete type dependency here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in commit d71bf19PythonTypeHandler.handle() now accepts SemanticHasherProtocol instead of the concrete SemanticAwarePythonHasher. The SemanticAwarePythonHasher import was removed from the TYPE_CHECKING block in hashing_protocols.py entirely.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already addressed — PythonTypeHandlerProtocol.handle() in hashing_protocols.py uses hasher: "SemanticHasherProtocol" as the parameter type (line 63 of the protocol definition). The concrete SemanticAwarePythonHasher is never referenced there.

@property
def type_handler_registry(self) -> "TypeHandlerRegistry":
"""Return the TypeHandlerRegistry used by this hasher."""
def type_handler_registry(self) -> "PythonTypeHandlerRegistry":

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should PythonTypeHandlerRegistry here be defined in terms of Protocol that corresponds to HandlerRegistry instead?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in commit d71bf19 — introduced HandlerRegistryProtocol in hashing_protocols.py that exposes get_handler(), get_handler_for_type(), and has_handler(). SemanticHasherProtocol.type_handler_registry now returns HandlerRegistryProtocol instead of the concrete PythonTypeHandlerRegistry. The PythonTypeHandlerRegistry import was removed from the TYPE_CHECKING block in hashing_protocols.py. The versioned_hashers.py Any annotation was also updated to HandlerRegistryProtocol | None as a result.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already addressed — SemanticHasherProtocol.type_handler_registry returns HandlerRegistryProtocol (line 123 of hashing_protocols.py). The concrete PythonTypeHandlerRegistry is not referenced in the protocol definition.

…ol, decouple type annotations

- PythonTypeHandlerRegistry: rename get_semantic_hasher → get_handler,
  get_semantic_hasher_for_type → get_handler_for_type,
  has_semantic_hasher → has_handler; update all call sites
- hashing_protocols: add HandlerRegistryProtocol abstracting over the
  concrete registry; SemanticHasherProtocol.type_handler_registry now
  returns HandlerRegistryProtocol instead of PythonTypeHandlerRegistry;
  PythonTypeHandler.handle() now uses SemanticHasherProtocol instead of
  the concrete SemanticAwarePythonHasher; remove concrete-class imports
  from TYPE_CHECKING block
- versioned_hashers: type type_handler_registry param as
  HandlerRegistryProtocol | None instead of Any | None; drop unused
  Any import
- Update test_hashing.py and test_semantic_hasher.py for renamed methods

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@kurodo3

kurodo3 Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor Author

Round 2 review — all comments addressed in commit d71bf19

Comment Action
type_handler_registry.py:93 — rename get_semantic_hasherget_handler ✅ Done. Updated call sites in semantic_hasher.py, visitors.py, and both test files.
type_handler_registry.py:118 — rename get_semantic_hasher_for_typeget_handler_for_type ✅ Done. Updated internal self-call in has_handler and all test files.
type_handler_registry.py:139 — rename has_semantic_hasherhas_handler ✅ Done. Updated visitors.py and all test files.
versioned_hashers.py:52 — why Any? ✅ Fixed. Now typed as hp.HandlerRegistryProtocol | None. The unused Any import was also removed.
hashing_protocols.py:65handle() should use SemanticHasherProtocol ✅ Done. PythonTypeHandler.handle() now accepts SemanticHasherProtocol. The SemanticAwarePythonHasher import removed from TYPE_CHECKING.
hashing_protocols.py:105 — use a protocol for the registry ✅ Done. Added new HandlerRegistryProtocol to hashing_protocols.py with get_handler(), get_handler_for_type(), has_handler(). SemanticHasherProtocol.type_handler_registry now returns HandlerRegistryProtocol. PythonTypeHandlerRegistry import removed from TYPE_CHECKING.

All 4269 tests pass (56 skipped, 13 xfailed).

the single source of truth for versioned component configuration.
"""

def __init__(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be SemanticHasherProtocol?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — content_identifiable_mixin.py now imports and uses SemanticHasherProtocol (from orcapod.protocols.hashing_protocols) everywhere the concrete SemanticAwarePythonHasher was referenced: the __init__ parameter type, the _get_hasher() return type, and all docstring references.

self,
hasher_id: str,
type_handler_registry: TypeHandlerRegistry | None = None,
type_handler_registry: PythonTypeHandlerRegistry | None = None,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this not be the protocol?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — SemanticAwarePythonHasher.__init__ now accepts type_handler_registry: HandlerRegistryProtocol | None = None. The stored _registry attribute is typed as HandlerRegistryProtocol so the concrete PythonTypeHandlerRegistry (which satisfies the protocol) still works, as does any other conforming registry.

@property
def type_handler_registry(self) -> TypeHandlerRegistry:
"""Return the ``TypeHandlerRegistry`` used by this hasher."""
def type_handler_registry(self) -> PythonTypeHandlerRegistry:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are going to take in protocol, this should be protocol being returned

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — the type_handler_registry property now has return type HandlerRegistryProtocol (imported from orcapod.protocols.hashing_protocols), consistent with the parameter change in __init__.

from orcapod.protocols.hashing_protocols import (
ArrowHasherProtocol,
TypeHandlerProtocol,
PythonTypeHandler,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explicitly call this "PythonTypeHandlerProtocol"

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — renamed PythonTypeHandlerPythonTypeHandlerProtocol in hashing_protocols.py. Updated all references across type_handler_registry.py, semantic_hasher.py, builtin_handlers.py, hashing/__init__.py, and semantic_hashing/__init__.py. As a related sweep, CallableWithPod in function_pod.py was also renamed to CallableWithPodProtocol to satisfy the project-wide 'all Protocol subclasses end with Protocol' convention. The strict-mode error message was also refined to read "no implementation of PythonTypeHandlerProtocol registered".

…ncrete types

- Rename PythonTypeHandler → PythonTypeHandlerProtocol everywhere: class
  definition in hashing_protocols.py, all type annotations in
  type_handler_registry.py, hashing/__init__.py export, and all docstring
  references across builtin_handlers.py and semantic_hashing/__init__.py
- Rename CallableWithPod → CallableWithPodProtocol in function_pod.py
- SemanticAwarePythonHasher.__init__ now accepts HandlerRegistryProtocol | None
  instead of PythonTypeHandlerRegistry | None; drop concrete-class import
- SemanticAwarePythonHasher.type_handler_registry property now returns
  HandlerRegistryProtocol instead of PythonTypeHandlerRegistry
- ContentIdentifiableMixin now imports and uses SemanticHasherProtocol instead
  of the concrete SemanticAwarePythonHasher for __init__ param and _get_hasher
  return type
- Update strict-mode error messages to say "no implementation of
  PythonTypeHandlerProtocol registered"; update matching test assertions

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@kurodo3

kurodo3 Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor Author

Round 3 review — changes made

All four comments addressed in commit e8129d7.

ContentIdentifiableMixin uses SemanticHasherProtocol (content_identifiable_mixin.py)
Replaced the concrete SemanticAwarePythonHasher import with SemanticHasherProtocol from orcapod.protocols.hashing_protocols. Updated the __init__ parameter type, the _get_hasher() return type annotation, and all docstring references.

SemanticAwarePythonHasher accepts / returns HandlerRegistryProtocol (semantic_hasher.py)
__init__ now takes type_handler_registry: HandlerRegistryProtocol | None = None and the type_handler_registry property now returns HandlerRegistryProtocol. The stored _registry attribute is also typed as HandlerRegistryProtocol, so any conforming registry (including PythonTypeHandlerRegistry) works without change.

Rename PythonTypeHandler → PythonTypeHandlerProtocol (hashing_protocols.py + all callers)
Renamed the protocol class in hashing_protocols.py and updated every reference across type_handler_registry.py, semantic_hasher.py, builtin_handlers.py, hashing/__init__.py, and semantic_hashing/__init__.py.

Bonus: enforce project-wide Protocol naming convention
CallableWithPod in function_pod.py renamed to CallableWithPodProtocol — the one remaining Protocol subclass that didn't end with Protocol.

Refined strict-mode error message
Now reads: "no implementation of PythonTypeHandlerProtocol registered for type …" (was "no PythonTypeHandler registered").

All 4269 tests pass, 56 skipped, 13 xfailed.

"""Hasher for ``types.UnionType`` objects (Python 3.10+ ``X | Y`` syntax)."""

def handle(self, obj: Any, hasher: "SemanticHasherProtocol") -> Any:
def handle(self, obj: Any, hasher: "SemanticAwarePythonHasher") -> Any:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this also take in a protocol?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — UnionTypeHandler.handle() now takes hasher: "SemanticHasherProtocol" (along with all other built-in handler handle() methods — see the "make all handle take in a protocol" comment).

from orcapod.contexts import get_default_context
return get_default_context().arrow_hasher # type: ignore[return-value]

def handle(self, obj: Any, hasher: "SemanticAwarePythonHasher") -> ContentHash:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make this take in protocol

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — ArrowTableHandler.handle() now takes hasher: "SemanticHasherProtocol". Addressed together with all other handle() method signatures.

"""

def handle(self, obj: Any, hasher: "SemanticHasherProtocol") -> Any:
def handle(self, obj: Any, hasher: "SemanticAwarePythonHasher") -> Any:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make all handle take in a protocol

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — all 11 handle() method signatures across every built-in handler class now use hasher: "SemanticHasherProtocol". The concrete SemanticAwarePythonHasher import has been removed from the TYPE_CHECKING block entirely; SemanticHasherProtocol is imported from orcapod.protocols.hashing_protocols instead.


def register_builtin_handlers(
registry: "TypeHandlerRegistry",
def register_builtin_python_type_handlers(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this registry should be protocol

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — register_builtin_python_type_handlers() now takes registry: "HandlerRegistryProtocol" (imported from orcapod.protocols.hashing_protocols). The PythonTypeHandlerRegistry import in the TYPE_CHECKING block has been removed. The docstring was updated to match.

With an injected hasher (e.g. in tests)::

hasher = BaseSemanticHasher(hasher_id="test", strict=True)
hasher = SemanticHasherProtocol(hasher_id="test", strict=True)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes no sense -- you cannot be instantiating protocol.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — the docstring example now shows from orcapod.hashing.semantic_hashing.semantic_hasher import SemanticAwarePythonHasher followed by hasher = SemanticAwarePythonHasher(hasher_id="test", strict=True). SemanticHasherProtocol is a structural Protocol and cannot be instantiated directly.

- Replace all `handle()` hasher params with SemanticHasherProtocol
  (was SemanticAwarePythonHasher) across all 11 builtin handler classes
- Change register_builtin_python_type_handlers() registry param from
  PythonTypeHandlerRegistry to HandlerRegistryProtocol
- Remove concrete-class imports from TYPE_CHECKING block; import
  SemanticHasherProtocol and HandlerRegistryProtocol from protocols module
- Fix content_identifiable_mixin.py docstring example that incorrectly
  showed SemanticHasherProtocol being instantiated; replace with
  SemanticAwarePythonHasher (the concrete class)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@kurodo3

kurodo3 Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor Author

Round 4 review — changes made

All five comments addressed in commit 7caa8af.

All handle() methods now take SemanticHasherProtocol (builtin_handlers.py)
All 11 built-in handler classes — PathHandler, UPathHandler, UUIDHandler, BytesHandler, FunctionHandler, TypeObjectHandler, SpecialFormHandler, GenericAliasHandler, UnionTypeHandler, ArrowTableHandler, SchemaHandler — now declare hasher: "SemanticHasherProtocol" in their handle() signatures. The concrete SemanticAwarePythonHasher is no longer imported in this file at all.

register_builtin_python_type_handlers() now accepts HandlerRegistryProtocol (builtin_handlers.py)
The registry parameter type was changed from "PythonTypeHandlerRegistry" to "HandlerRegistryProtocol". The TYPE_CHECKING block now imports SemanticHasherProtocol and HandlerRegistryProtocol from orcapod.protocols.hashing_protocols; the old concrete-class imports (SemanticAwarePythonHasher, PythonTypeHandlerRegistry) are removed.

Docstring example fixed (content_identifiable_mixin.py)
The "With an injected hasher" example previously showed SemanticHasherProtocol(hasher_id="test", ...) which is invalid since SemanticHasherProtocol is a structural Protocol and cannot be instantiated. Replaced with SemanticAwarePythonHasher(hasher_id="test", ...).

Also replied to two comments from review 4566882270 (hashing_protocols.py) that had not been replied to — both were already addressed in commit d71bf19.

All 4269 tests pass, 56 skipped, 13 xfailed.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 39 out of 39 changed files in this pull request and generated 2 comments.

Comment on lines +230 to 234
def register_builtin_python_type_handlers(
registry: "HandlerRegistryProtocol",
file_hasher: Any = None,
function_info_extractor: Any = None,
arrow_hasher: "ArrowHasherProtocol | None" = None,

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — added register() and __len__() to HandlerRegistryProtocol in hashing_protocols.py. The protocol now declares the full surface used by register_builtin_python_type_handlers(): lookup (get_handler, get_handler_for_type, has_handler), mutation (register), and size (__len__).

Comment thread tests/test_hashing/test_uuid_handler.py Outdated
Comment on lines +1 to +5
"""Tests for UUIDHandler hash() method behaviour.

Verifies that UUIDHandler returns the 16-byte binary representation of a
UUID, consistent with OrcaPod's canonical ``pa.binary(16)`` Arrow storage
format.
Verifies that UUIDHandler produces a ContentHash based on the 16-byte
binary representation of a UUID, consistent with OrcaPod's canonical
``pa.binary(16)`` Arrow storage format.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — updated the module docstring from "Tests for UUIDHandler hash() method behaviour" to "Tests for UUIDHandler handle() dispatch via SemanticAwarePythonHasher". UUIDHandler implements handle(), not hash(), and the tests exercise SemanticAwarePythonHasher.hash_object() dispatch.

- Add register() and __len__() to HandlerRegistryProtocol so the protocol
  matches every method called on registry inside
  register_builtin_python_type_handlers(); previously HandlerRegistryProtocol
  only declared the lookup side of the interface (get_handler,
  get_handler_for_type, has_handler), leaving register() and len() untyped
- Fix test_uuid_handler.py module docstring: s/hash() method behaviour/
  handle() dispatch via SemanticAwarePythonHasher/ — UUIDHandler implements
  handle(), not hash(), and the tests exercise hash_object() dispatch

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@kurodo3

kurodo3 Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor Author

Round 5 review (Copilot) — changes made

Both comments addressed in commit 5c57b71.

HandlerRegistryProtocol now declares the full interface (hashing_protocols.py)
Added register(target_type, handler) and __len__() to HandlerRegistryProtocol. Previously the protocol only covered the lookup side (get_handler, get_handler_for_type, has_handler). Since register_builtin_python_type_handlers() calls both registry.register(...) and len(registry), the protocol must declare these too — otherwise type checkers correctly flag the calls as invalid on the protocol type.

Test module docstring corrected (tests/test_hashing/test_uuid_handler.py)
Changed first line from "Tests for UUIDHandler hash() method behaviour." to "Tests for UUIDHandler handle() dispatch via SemanticAwarePythonHasher.". UUIDHandler implements handle(), not hash(), and the tests exercise SemanticAwarePythonHasher.hash_object() dispatch end-to-end.

All 4269 tests pass, 56 skipped, 13 xfailed.

-------
SemanticHasherProtocol
A fully configured SemanticHasherProtocol instance.
Optional ``PythonTypeHandlerRegistry`` to inject. When None the

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update this doc string to match the function input parameter types. Make sure the same is done everywhere

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in commit b63ff2d. Updated all four places where docstrings/annotations referenced concrete type names instead of protocol names:

  • versioned_hashers.py: summary line SemanticAwarePythonHasherSemanticHasherProtocol; type_handler_registry param description PythonTypeHandlerRegistryHandlerRegistryProtocol
  • arrow_hashers.py: semantic_hasher param annotation and docstring SemanticAwarePythonHasherSemanticHasherProtocol; updated TYPE_CHECKING import
  • visitors.py: same for python_hasher param in SemanticHashingVisitor
  • defaults.py: "owned by the active SemanticAwarePythonHasher" → "owned by the active SemanticHasherProtocol"

Update all docstrings and type annotations to consistently use protocol
types instead of concrete implementation types:

- versioned_hashers.py: fix summary ("SemanticAwarePythonHasher" →
  "SemanticHasherProtocol") and type_handler_registry param description
  ("PythonTypeHandlerRegistry" → "HandlerRegistryProtocol") to match
  the hp.HandlerRegistryProtocol annotation already on the parameter
- arrow_hashers.py: change semantic_hasher param annotation and docstring
  from SemanticAwarePythonHasher to SemanticHasherProtocol; update
  TYPE_CHECKING import accordingly
- visitors.py: same for python_hasher param in SemanticHashingVisitor
- defaults.py: update "owned by SemanticAwarePythonHasher" to
  "owned by SemanticHasherProtocol" in get_default_python_type_handler_registry

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@kurodo3

kurodo3 Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor Author

Round 6 review — response summary

Addressed eywalker's comment about docstring/annotation mismatches with a codebase-wide sweep.

Changes made (commit b63ff2d)

src/orcapod/hashing/versioned_hashers.py

  • Summary line: "SemanticAwarePythonHasher" → "SemanticHasherProtocol" (matches return type)
  • type_handler_registry param description: "PythonTypeHandlerRegistry" → "HandlerRegistryProtocol" (matches annotation)

src/orcapod/hashing/arrow_hashers.py

  • TYPE_CHECKING import: SemanticAwarePythonHasherSemanticHasherProtocol
  • semantic_hasher param annotation: updated to "SemanticHasherProtocol"
  • Docstring: "SemanticAwarePythonHasher" → "SemanticHasherProtocol"

src/orcapod/hashing/visitors.py

  • TYPE_CHECKING import: SemanticAwarePythonHasherSemanticHasherProtocol
  • python_hasher param annotation: updated to "SemanticHasherProtocol"
  • Docstring: updated to reference SemanticHasherProtocol

src/orcapod/hashing/defaults.py

  • "owned by the active SemanticAwarePythonHasher" → "owned by the active SemanticHasherProtocol"

The codebase now consistently uses protocol types (SemanticHasherProtocol, HandlerRegistryProtocol) in all annotations and docstrings, with concrete types (SemanticAwarePythonHasher, PythonTypeHandlerRegistry) appearing only where describing the concrete class itself.

@eywalker eywalker merged commit 2ee52de into extension-type-system Jun 25, 2026
11 checks passed
@eywalker eywalker deleted the eywalker/plt-1660-hard-cut-delete-old-semantic-type-system-and-wire-in branch June 25, 2026 03:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants