Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 31 additions & 0 deletions DESIGN_ISSUES.md
Original file line number Diff line number Diff line change
Expand Up @@ -1086,6 +1086,37 @@ element type. See PLT-1732 for full design.

---

## `src/orcapod/databases/connector_arrow_database.py`

### CA1 — SQL connectors silently lose Arrow extension-type field metadata on round-trip
**Status:** in progress
**Severity:** high
**Issue:** PLT-1795

`SQLiteConnector` (and any `DBConnectorProtocol` implementation that maps Arrow → SQL types)
does not preserve `ARROW:extension:name` / `ARROW:extension:metadata` field metadata. When a
column whose Arrow type is a `pa.ExtensionType` (e.g. `orcapod.path`, `orcapod.uuid`, or any
dataclass extension type) is written via `ConnectorArrowDatabase.add_records()` and then read
back, the column is returned as the raw storage type (e.g. `large_string`, `large_binary`,
`struct`) with no extension marker. This makes SQL connector round-trips impossible and causes silent data-type loss.

**Interim fix (PLT-1659):** `ConnectorArrowDatabase.add_records()` now raises `ValueError`
immediately when any column is extension-typed, surfacing the issue at write
time rather than on a confusing read. Two representations are rejected:
Comment thread
Copilot marked this conversation as resolved.
- In-memory extension types: `isinstance(field.type, pa.ExtensionType)`.
- Metadata-only columns: plain storage type whose field metadata contains
`b"ARROW:extension:name"` (the representation produced when reading a Parquet/IPC file
with an unregistered extension type).

**Full fix (PLT-1795, target v0.2):** Preserve extension-type metadata in the SQL schema via
a companion metadata table (one row per column: `table_name`, `column_name`,
`extension_name`, `extension_metadata`). On `create_table_if_not_exists`, write rows for any
extension-typed columns; on `iter_batches`, join the metadata table and reconstruct the
`pa.ExtensionType` for affected columns before returning the batch. Once implemented, the
`ValueError` guard in `add_records()` can be lifted.

---

## `src/orcapod/semantic_types/universal_converter.py`

### UC1 — `python_type_to_arrow_type` raised on `typing.Any` from empty-container inference
Expand Down
28 changes: 28 additions & 0 deletions src/orcapod/databases/connector_arrow_database.py
Original file line number Diff line number Diff line change
Expand Up @@ -244,6 +244,34 @@ def add_records(
f"got {rid_type}. Encode the column to bytes before calling add_records()."
)

# Reject Arrow extension-typed columns: SQL connectors do not preserve
# ARROW:extension:* field metadata, so extension types would be silently
# dropped on read, making round-trips impossible. Use DeltaTableDatabase
# or write directly to Parquet instead. See PLT-1795 for the planned fix.
#
# Two representations are checked:
# 1. In-memory extension types: isinstance(field.type, pa.ExtensionType).
# 2. Metadata-only extension columns: a plain Arrow type whose field metadata
# contains the b"ARROW:extension:name" key. This arises when reading a
# Parquet/IPC file with an unregistered extension type — the array is
# decoded as its storage type but the metadata is preserved on the field.
_EXT_NAME_KEY = b"ARROW:extension:name"
ext_fields: list[tuple[str, str]] = []
for field in records.schema:
if isinstance(field.type, pa.ExtensionType):
ext_fields.append((field.name, field.type.extension_name))
elif field.metadata and _EXT_NAME_KEY in field.metadata:
ext_fields.append((field.name, field.metadata[_EXT_NAME_KEY].decode("utf-8", errors="replace")))
if ext_fields:
ext_info = ", ".join(f"{name!r}: {ext_name!r}" for name, ext_name in ext_fields)
raise ValueError(
f"ConnectorArrowDatabase does not support Arrow extension-typed columns "
f"({ext_info}). SQL connectors do not preserve ARROW:extension:* field "
f"metadata, so extension types would be silently dropped on read. "
f"Use DeltaTableDatabase or write directly to Parquet instead. "
f"See PLT-1795 for the planned fix."
)
Comment on lines +267 to +273

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Added TestExtensionTypeWriteGuard in tests/test_databases/test_connector_arrow_database.py with three tests: test_rejects_in_memory_extension_type_column, test_rejects_metadata_only_extension_column, and test_plain_column_not_rejected. The first test registers a minimal custom pa.ExtensionType for the duration of the test (cleaned up with pa.unregister_extension_type in a finally block).


records = self._deduplicate_within_table(records)
record_key = self._get_record_key(record_path)
input_ids = set(cast(list[bytes], records[self.RECORD_ID_COLUMN].to_pylist()))
Expand Down
Loading
Loading