test(extension-types): end-to-end round-trip integration tests (PLT-1659)#181
Conversation
…to ConnectorArrowDatabase SQL connectors do not preserve ARROW:extension:* field metadata, so writing extension-typed columns via ConnectorArrowDatabase would silently drop the extension type on read, making round-trips impossible. Adds an explicit ValueError guard in add_records() that fires immediately when any non-record-id column carries a pa.ExtensionType, surfacing the problem at write time with a message pointing to PLT-1795. Also adds DESIGN_ISSUES.md entry CA1 documenting the root cause, the interim guard, and the planned full fix (PLT-1795): a companion metadata table that persists extension-name/metadata alongside the SQL schema. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…T-1659) Covers two complementary angles: Arrow-level identity: register_python_class assigns each dataclass a unique extension name derived from its FQCN, so two same-shaped dataclasses produce different extension names. Also verifies idempotency (register twice → same name). Python-type-level compatibility: check_schema_compatibility correctly passes when types match and rejects when two same-shaped-but-different-named dataclasses are compared — the core guarantee that prevents silent data corruption. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…sts (PLT-1659) Verifies two cache properties of LogicalTypeRegistry: - After load_extension_types on a Parquet file, the type is present in the fresh converter's registry (cache populated on first read). - reconstruct_from_arrow is called exactly once for the first read and zero additional times on the second read of the same file (registry hit short-circuits factory dispatch). Uses patch.object with autospec=True to correctly handle self binding when spying on an instance method. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ation tests (PLT-1659) Adds 13 tests covering the full pipeline: Python object → write → storage → peek-schema → register → read → Python object Parametrised over Parquet and Delta backends (12 tests): - Built-in types: Path/orcapod.path, UPath/orcapod.upath, UUID/orcapod.uuid - Simple dataclass (_PointA): FQCN as extension name, Python object reconstructed - Two same-shaped dataclasses (_PointA vs _PointB): distinct extension names - Nested dataclass (_Outer/_Inner): both types registered transitively after read Delta Polars native-read test (1 test): - Write _PointA to Delta, read via pl.read_delta, assert Polars dtype is an extension type with the correct FQCN. Python object reconstruction via df.to_arrow() is intentionally not tested here — Polars strips __arrow_ext_metadata__ on export, making that path non-functional. The separate parametrised Delta tests cover full Python reconstruction. SQLite excluded: ConnectorArrowDatabase now raises ValueError on extension types (see companion fix in this branch). Delta read uses dt.file_uris() + pyarrow.dataset rather than DeltaTable.to_pyarrow_table(), which normalises large_string → string and breaks the storage-type-strict extension type deserializer. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
…tead of file_uris workaround DeltaTable.to_pyarrow_dataset(as_large_types=True) preserves large_string / large_binary rather than normalising to string / binary — the same approach used by DeltaTableDatabase._read_delta_table(). Replaces the previous workaround of reading underlying Parquet files directly via dt.file_uris() + pyarrow.dataset, which was correct but unnecessarily bypassed Delta's API. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds end-to-end integration coverage for the new Arrow/Polars extension type system (round-trip pipeline + schema compatibility + registry cache behavior), while also introducing an interim runtime guard to fail fast when writing extension-typed data through SQL-backed connectors that cannot preserve Arrow extension metadata.
Changes:
- Added 3 new integration test modules under
tests/test_extension_types/covering round-trips (Parquet/Delta), schema compatibility checks, and per-process registry cache behavior. - Added a
ValueErrorguard inConnectorArrowDatabase.add_records()to reject extension-typed writes (to surface SQL metadata-loss early). - Documented the SQL connector limitation and planned fix in
DESIGN_ISSUES.md, plus added internal plan/spec documents for PLT-1659.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
tests/test_extension_types/test_roundtrips.py |
New end-to-end round-trip integration tests across Parquet + Delta, including Polars native Delta read assertions. |
tests/test_extension_types/test_schema_compatibility.py |
New integration tests validating Arrow extension-name identity and Python-type compatibility checks. |
tests/test_extension_types/test_cache_behavior.py |
New integration tests asserting registry population and factory call short-circuiting on cache hits. |
src/orcapod/databases/connector_arrow_database.py |
Adds write-time rejection of extension-typed columns for SQL-backed connectors (interim safety guard). |
DESIGN_ISSUES.md |
Adds CA1 design issue documenting SQL metadata loss and the interim guard / planned full fix. |
superpowers/specs/2026-06-23-plt-1659-extension-type-roundtrip-integration-tests-design.md |
Design spec documenting intended integration test coverage and backend approach. |
superpowers/plans/2026-06-23-plt-1659-extension-type-roundtrip-integration-tests.md |
Implementation plan for adding the integration tests and related changes. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Reject Arrow extension-typed columns: SQL connectors do not preserve | ||
| # ARROW:extension:* field metadata, so extension types would be silently | ||
| # dropped on read, making round-trips impossible. Use DeltaTableDatabase | ||
| # or write directly to Parquet instead. See PLT-1795 for the planned fix. | ||
| ext_fields = [ | ||
| field.name | ||
| for field in records.schema | ||
| if isinstance(field.type, pa.ExtensionType) | ||
| ] | ||
| if ext_fields: | ||
| ext_info = ", ".join( | ||
| f"{records.schema.field(n).name!r}: {records.schema.field(n).type.extension_name!r}" | ||
| for n in ext_fields | ||
| ) | ||
| raise ValueError( | ||
| f"ConnectorArrowDatabase does not support Arrow extension-typed columns " | ||
| f"({ext_info}). SQL connectors do not preserve ARROW:extension:* field " | ||
| f"metadata, so extension types would be silently dropped on read. " | ||
| f"Use DeltaTableDatabase or write directly to Parquet instead. " | ||
| f"See PLT-1795 for the planned fix." | ||
| ) |
There was a problem hiding this comment.
Fixed. The guard now checks both representations: (1) in-memory extension types via isinstance(field.type, pa.ExtensionType), and (2) metadata-only columns — plain storage type whose field metadata contains b"ARROW:extension:name" (the representation produced when reading Parquet with an unregistered extension type in the current process). New tests in TestExtensionTypeWriteGuard cover both rejection cases and a negative case for plain columns.
| raise ValueError( | ||
| f"ConnectorArrowDatabase does not support Arrow extension-typed columns " | ||
| f"({ext_info}). SQL connectors do not preserve ARROW:extension:* field " | ||
| f"metadata, so extension types would be silently dropped on read. " | ||
| f"Use DeltaTableDatabase or write directly to Parquet instead. " | ||
| f"See PLT-1795 for the planned fix." | ||
| ) |
There was a problem hiding this comment.
Fixed. Added TestExtensionTypeWriteGuard in tests/test_databases/test_connector_arrow_database.py with three tests: test_rejects_in_memory_extension_type_column, test_rejects_metadata_only_extension_column, and test_plain_column_not_rejected. The first test registers a minimal custom pa.ExtensionType for the duration of the test (cleaned up with pa.unregister_extension_type in a finally block).
| Each round-trip test is parameterised over two storage backends: | ||
|
|
||
| - ``parquet``: direct ``pyarrow.parquet`` write/read. | ||
| - ``delta``: ``deltalake.write_deltalake`` / ``DeltaTable.to_pyarrow_table()``. | ||
|
|
There was a problem hiding this comment.
Fixed. Module docstring now reads: DeltaTable.to_pyarrow_dataset(as_large_types=True).to_table().
| ### Built-in types: `Path`, `UPath`, `UUID` | ||
|
|
||
| Round-trip through all three storage backends. Assertions: | ||
| - Python object is faithfully reconstructed after read. | ||
| - Arrow extension names are in the `orcapod.*` namespace (`orcapod.path`, `orcapod.upath`, | ||
| `orcapod.uuid`). |
There was a problem hiding this comment.
Fixed. Updated to: "Round-trip through two storage backends (Parquet and Delta — SQLite excluded, see test_roundtrips.py note)."
| `test_roundtrips.py` parameterises over three storage backends via a `_StorageBackend` | ||
| dataclass with two callables: | ||
|
|
||
| ```python | ||
| @dataclasses.dataclass | ||
| class _StorageBackend: | ||
| name: str | ||
| write: Callable[[pa.Table, Path], None] | ||
| read: Callable[[Path, UniversalTypeConverter], pa.Table] | ||
| ``` | ||
|
|
||
| | `name` | `write` | `read` | | ||
| |---|---|---| | ||
| | `"parquet"` | `pq.write_table(table, path / "data.parquet")` | `converter.load_extension_types(pq.read_table(path / "data.parquet"))` | | ||
| | `"delta"` | `deltalake.write_deltalake(str(path / "delta"), table)` | `converter.load_extension_types(DeltaTable(str(path / "delta")).to_pyarrow_table())` | | ||
| | `"sqlite"` | `ConnectorArrowDatabase(SQLiteConnector(path / "db.sqlite")).add_record(...).flush()` | `ExtensionAwareDatabase(db, converter).get_all_records(...)` → drop `__record_id` column | | ||
|
|
There was a problem hiding this comment.
Fixed. The Backend Parameterisation section now lists two backends only (Parquet and Delta), drops the SQLite row with an explanation referencing DESIGN_ISSUES.md CA1 and PLT-1795, and updates the Delta read description to DeltaTable.to_pyarrow_dataset(as_large_types=True).to_table() with a note explaining why as_large_types=True is required.
| **Goal:** Add three new integration test files covering end-to-end extension type round-trips through Parquet, Delta Lake, schema compatibility, and per-process cache behaviour. | ||
|
|
||
| **Architecture:** Pure test-only change — no source files modified. Three focused test files: `test_roundtrips.py` (write/read through Parquet and Delta backends), `test_schema_compatibility.py` (Arrow-level identity + Python-type-level compatibility), `test_cache_behavior.py` (registry cache populated and skipped on second read). SQLite backend is excluded from value round-trip tests because `SQLiteConnector` does not preserve `ARROW:extension:*` field metadata; that pattern is already covered by `test_extension_aware_database.py`. | ||
|
|
There was a problem hiding this comment.
Fixed. The Architecture section now reads: "Three focused test files plus one source change and one docs update" and describes the ConnectorArrowDatabase.add_records() guard.
| No source files are modified. | ||
|
|
There was a problem hiding this comment.
Fixed. The File Map now includes two additional rows: Modify src/orcapod/databases/connector_arrow_database.py and Modify DESIGN_ISSUES.md, and the "No source files are modified" line has been removed.
| **Interim fix (PLT-1659):** `ConnectorArrowDatabase.add_records()` now raises `ValueError` | ||
| immediately when any non-record-id column carries an Arrow extension type (checked via | ||
| `isinstance(field.type, pa.ExtensionType)`), surfacing the issue at write time rather than | ||
| on a confusing read. |
There was a problem hiding this comment.
Fixed. CA1 now explicitly lists both rejection cases: (1) isinstance(field.type, pa.ExtensionType) for in-memory extension types, and (2) plain storage type whose field metadata contains b"ARROW:extension:name" for the metadata-only representation.
…lumns; add tests Address Copilot review comments on PR #181: - Broaden ConnectorArrowDatabase.add_records() guard to reject both in-memory pa.ExtensionType columns AND metadata-only extension columns (plain storage type with b"ARROW:extension:name" in field metadata, the representation produced when reading Parquet with an unregistered type). Previously only the isinstance(pa.ExtensionType) case was caught. - Add TestExtensionTypeWriteGuard in test_connector_arrow_database.py with three focused tests: rejects in-memory extension type, rejects metadata-only extension column, accepts plain columns without raising. - Fix test_roundtrips.py module docstring: Delta backend uses to_pyarrow_dataset(as_large_types=True).to_table(), not to_pyarrow_table(). - Update DESIGN_ISSUES.md CA1 to describe both rejection cases. - Update plan and spec files to reflect actual scope: plan File Map now lists the connector_arrow_database.py and DESIGN_ISSUES.md changes; Architecture section no longer claims "pure test-only"; spec Backend Parameterisation section corrected to two backends (Parquet + Delta, no SQLite) with the accurate Delta read API. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Review round 1 — changes madeAll 8 Copilot comments addressed in commit 9ccfcaa. Source / test changes
Documentation changes
|
| dataclass extension type) is written via `ConnectorArrowDatabase.add_records()` and then read | ||
| back, the column is returned as the raw storage type (e.g. `large_string`, `large_binary`, | ||
| `struct`) with no extension marker. This makes Parquet/Delta round-trips impossible through | ||
| the SQL backend and causes silent data-type loss. |
There was a problem hiding this comment.
Fixed: changed "makes Parquet/Delta round-trips impossible through the SQL backend" to "makes SQL connector round-trips impossible". The original wording was misleading — Parquet and Delta backends work fine; it is only the SQL connector path that cannot preserve extension-type metadata.
| ### `test_roundtrips.py` | ||
|
|
||
| #### Parameterised over all three backends | ||
|
|
There was a problem hiding this comment.
Fixed: changed "Parameterised over all three backends" to "Parameterised over both backends" in the Test Descriptions section of the spec.
| Each round-trip test is parameterised over two storage backends: | ||
|
|
||
| - ``parquet``: direct ``pyarrow.parquet`` write/read. | ||
| - ``delta``: ``deltalake.write_deltalake`` / ``DeltaTable.to_pyarrow_table()``. | ||
|
|
There was a problem hiding this comment.
Fixed: the module docstring description line for the delta backend in Task 4's code snippet now reads DeltaTable.to_pyarrow_dataset(as_large_types=True).to_table() instead of DeltaTable.to_pyarrow_table().
| def _delta_read(base_path: Path, converter: UniversalTypeConverter) -> pa.Table: | ||
| import deltalake | ||
| dt = deltalake.DeltaTable(str(base_path / "delta")) | ||
| raw = dt.to_pyarrow_table() | ||
| return converter.load_extension_types(raw) |
There was a problem hiding this comment.
Fixed: the _delta_read function body in Task 4's code snippet now uses dt.to_pyarrow_dataset(as_large_types=True).to_table() with an inline comment explaining that as_large_types=True is needed to prevent Delta Lake from normalising large_string/large_binary to string/binary.
| This test reads a Delta table back via `pl.read_delta` (Polars' native Delta reader) rather than `DeltaTable.to_pyarrow_table()`, verifying that extension type metadata survives the Polars path. | ||
|
|
||
| When the write-side converter calls `register_python_class(_PointA)`, it registers `_PointA` in both PyArrow's and Polars' **global** registries (as a side-effect of `registry.register_logical_type`). That global registration persists for the duration of the test process, so `pl.read_delta` can resolve `_PointA`'s extension type when reading the underlying Parquet files. | ||
|
|
||
| - [ ] **Step 1: Append the Delta Polars test to `test_roundtrips.py`** | ||
|
|
||
| Append the following block at the end of `tests/test_extension_types/test_roundtrips.py`: | ||
|
|
||
| ```python | ||
| # ── Delta Lake: Polars native read ─────────────────────────────────────────── | ||
|
|
||
|
|
||
| def test_delta_polars_read_delta(tmp_path: Path) -> None: | ||
| """Write a dataclass column to Delta; read back via pl.read_delta; extension type preserved. | ||
|
|
||
| The write-side converter registers _PointA in both PyArrow's and Polars' | ||
| global registries. pl.read_delta can then decode the column as the correct | ||
| extension type. load_extension_types on the resulting Arrow table registers | ||
| _PointA in the fresh read-side converter and wraps the column. | ||
| """ |
There was a problem hiding this comment.
Fixed: replaced the Task 5 snippet with the actual implementation. The old snippet used load_extension_types(df.to_arrow()) followed by arrow_table_to_python_dicts to attempt a full Python-object round-trip, but pl.DataFrame.to_arrow() exports Polars extension types with empty __arrow_ext_metadata__ bytes, so that path cannot reconstruct Python objects. The real test asserts only that the Polars column dtype is an extension type: col_dtype.is_extension() and col_dtype.ext_name() == fqcn. The docstring now explains this limitation explicitly.
- spec: "all three" → "both" backends in Test Descriptions heading - plan: add autospec=True to patch.object snippet in Task 3 - plan: update _delta_read snippet to use to_pyarrow_dataset(as_large_types=True).to_table() in both the module docstring and the function body (Task 4) - plan: rewrite test_delta_polars_read_delta snippet to reflect actual implementation (Polars dtype check via col_dtype.is_extension() / ext_name() instead of load_extension_types(df.to_arrow()) round-trip, Task 5) - DESIGN_ISSUES.md CA1: "non-record-id column" → "any column" (guard checks all fields) - DESIGN_ISSUES.md CA1: "Parquet/Delta round-trips impossible" → "SQL connector round-trips impossible" Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Round 2 review — addressedAll 7 Copilot comments fixed in commit b20c8db (docs-only, no test/source changes).
|
…lumns; add tests Address Copilot review comments on PR #181: - Broaden ConnectorArrowDatabase.add_records() guard to reject both in-memory pa.ExtensionType columns AND metadata-only extension columns (plain storage type with b"ARROW:extension:name" in field metadata, the representation produced when reading Parquet with an unregistered type). Previously only the isinstance(pa.ExtensionType) case was caught. - Add TestExtensionTypeWriteGuard in test_connector_arrow_database.py with three focused tests: rejects in-memory extension type, rejects metadata-only extension column, accepts plain columns without raising. - Fix test_roundtrips.py module docstring: Delta backend uses to_pyarrow_dataset(as_large_types=True).to_table(), not to_pyarrow_table(). - Update DESIGN_ISSUES.md CA1 to describe both rejection cases. - Update plan and spec files to reflect actual scope: plan File Map now lists the connector_arrow_database.py and DESIGN_ISSUES.md changes; Architecture section no longer claims "pure test-only"; spec Backend Parameterisation section corrected to two backends (Parquet + Delta, no SQLite) with the accurate Delta read API. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…lumns; add tests Address Copilot review comments on PR #181: - Broaden ConnectorArrowDatabase.add_records() guard to reject both in-memory pa.ExtensionType columns AND metadata-only extension columns (plain storage type with b"ARROW:extension:name" in field metadata, the representation produced when reading Parquet with an unregistered type). Previously only the isinstance(pa.ExtensionType) case was caught. - Add TestExtensionTypeWriteGuard in test_connector_arrow_database.py with three focused tests: rejects in-memory extension type, rejects metadata-only extension column, accepts plain columns without raising. - Fix test_roundtrips.py module docstring: Delta backend uses to_pyarrow_dataset(as_large_types=True).to_table(), not to_pyarrow_table(). - Update DESIGN_ISSUES.md CA1 to describe both rejection cases. - Update plan and spec files to reflect actual scope: plan File Map now lists the connector_arrow_database.py and DESIGN_ISSUES.md changes; Architecture section no longer claims "pure test-only"; spec Backend Parameterisation section corrected to two backends (Parquet + Delta, no SQLite) with the accurate Delta read API. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Summary
Python object → write → storage → peek-schema → register → read → Python object)ValueErrorguard inConnectorArrowDatabase.add_records()that fires immediately when extension-typed columns are passed, surfacing the SQLite metadata loss problem at write time (interim fix while PLT-1795 is pending)DESIGN_ISSUES.mdentry CA1 documenting the SQL connector field-metadata limitation and the planned full fixNew test files
test_roundtrips.py— 13 tests parametrised over Parquet and Delta backends:Path/orcapod.path,UPath/orcapod.upath,UUID/orcapod.uuid_PointA): FQCN as extension name, Python objects reconstructed_PointAvs_PointB): distinct extension names confirmed_Outer/_Inner): both types registered transitively on readpl.read_delta→ Polars dtype is the correct extension typetest_schema_compatibility.py— 4 tests:check_schema_compatibilitypasses for same type, rejects for same-shaped-but-different-named typestest_cache_behavior.py— 2 tests:load_extension_typescall on a fresh converterreconstruct_from_arrowis called exactly once; second read is a registry hit (factory skipped)Notable implementation details
_delta_readusesdt.file_uris()+pyarrow.datasetrather thanto_pyarrow_table()— Delta Lake normaliseslarge_string → stringin its schema layer, which breaks the storage-type-strict extension type deserializer; reading underlying Parquet files directly bypasses thispatch.object(..., autospec=True, wraps=...)required for spying on instance methods — withoutautospec=True,selfis not passed through to the wrapped functionDataclassLogicalTypeFactory— no stable FQCN)Deferred
list[MyDataclass]round-trip → PLT-1732 (requiresListLogicalType)ConnectorArrowDatabasenow raisesValueErroron extension-typed writes (interim guard); full fix in PLT-1795Test plan
uv run pytest tests/test_extension_types/ -q→ 241 passed, 1 xfaileduv run pytest tests/ -x -q --ignore=tests/test_semantic_types→ 3714 passed, 56 skipped, 6 xfailed (no regressions)Linear
Fixes PLT-1659
PLT-1795