AWS: Handle duplicate column names in IcebergToGlueConverter comment map#16853
AWS: Handle duplicate column names in IcebergToGlueConverter comment map#16853Wyatt-Hawes wants to merge 2 commits into
Conversation
| .collect(Collectors.toMap(Column::name, Column::comment)); | ||
| .collect( | ||
| Collectors.toMap( | ||
| Column::name, Column::comment, (existing, duplicate) -> existing)); |
There was a problem hiding this comment.
How about leaving a code comment explaining why existing should win? For me, it's unclear when reading this code.
There was a problem hiding this comment.
I agree that a clarifying comment would be helpful here.
Updated the commit to add a comment clarifying why existing entries should win.
3c44d6b to
b650c50
Compare
| @@ -266,10 +266,14 @@ static void setTableInputInformation( | |||
| Map<String, String> existingColumnMap = null; | |||
There was a problem hiding this comment.
nit: I think we should rename this map to columnNameToComment or existingColumnNameToComment, same goes for the naming of this map in toColumns and addColumnWithDedupe. The current name is confusing
There was a problem hiding this comment.
Agreed. The current name is misleading. Adding a second commit to this PR to resolve this.
| } | ||
|
|
||
| @Test | ||
| public void testSetTableInputInformationWithDuplicateExistingColumnComments() { |
There was a problem hiding this comment.
nit: How about just testDuplicateExistingColumnsKeepFirstComment
…t map
## What changed
`IcebergToGlueConverter.setTableInputInformation(TableInput.Builder, TableMetadata, Table)`
builds an `existingColumnMap` (column name -> comment) from the existing Glue
table's columns to recover comments for fields whose new Iceberg schema has no
`doc()`. The current `Collectors.toMap(Column::name, Column::comment)` throws
`IllegalStateException` on duplicate keys.
This change adds a first-seen-wins merge function so duplicate keys collapse
instead of throwing.
## Why
Duplicate `Column.name` entries can legitimately appear in an existing Glue
table's `Columns[]` after a case-only `RENAME COLUMN` cycle (e.g.
`id -> id__tmp -> Id`):
1. `RENAME COLUMN` only changes the field name; `field.doc()` is unchanged.
2. The converter walks `metadata.schemas()` in `toColumns` and emits
historical names alongside the current name; case-sensitive dedupe lets
`Id` and `id` through as distinct entries with the same comment.
3. Glue normalizes `Column.name` case-insensitively on persist, collapsing
`Id` and `id` into two `name="id"` rows that both carry a non-null comment.
4. Next commit: the stream above hits `Collectors.toMap` -> duplicate-key throw.
5. The throw is swallowed by the outer `catch (RuntimeException e) { LOG.warn(...) }`,
leaving `tableInputBuilder.storageDescriptor` unset.
6. `glue.updateTable()` is then called with `TableInput.storageDescriptor=null`,
producing a destructive write -- every subsequent read sees a null storage
descriptor.
The case-collision alone is not the bug; without comments the
`column.comment() != null` filter strips the colliding entries and the throw
never fires. The failure surfaces only when both colliding columns also have
non-null comments, because that's what the comment-recovery map keys on.
## Fix
First-seen-wins is the right semantic here: `toColumns` writes current-schema
entries before historical ones, so the current entry's comment is preserved on
the read-back. The converter's emit behavior is unchanged -- Glue still
receives duplicate-name `Columns[]` after a case-only rename, so time-travel
and Hive-fallback consumers depending on historical names appearing in
`Columns[]` are unaffected. `addColumnWithDedupe` is also untouched.
A case-insensitive write-side dedupe was considered and rejected -- it would
drop historical names from `Columns[]` and change the converter's documented
projection.
## Properties of this fix
- Resolves the `IllegalStateException` and the resulting null-storage-descriptor
`UpdateTable` calls.
- Heals already-broken tables on the next commit -- no manual dedupe required.
- No change to the converter's emit behavior.
- `addColumnWithDedupe` is untouched.
## Test
Adds `TestIcebergToGlueConverter.testSetTableInputInformationWithDuplicateExistingColumnComments`,
which constructs an `existingTable` with the broken `Columns[]` shape (two
`name="id"` entries, both with comments) and asserts that
`setTableInputInformation` produces a non-null `StorageDescriptor` with the
correct first-seen comment. Pre-fix, the test fails on a null
`storageDescriptor()` (the destructive bug surfaced in-process); post-fix it
passes.
All 13 tests in `TestIcebergToGlueConverter` pass; spotless clean.
… variable contents.
b650c50 to
10df000
Compare
What
IcebergToGlueConverter.setTableInputInformationbuilds aname → commentmap from the existing Glue table viaCollectors.toMap(Column::name, Column::comment), which throwsIllegalStateExceptionon duplicate keys.Intended Fix: Add an existing-entry-wins merge function so duplicates do not throw.
How this happens
The following
RENAME COLUMNcycle (id → id__tmp → Id): (Used SPARK with Glue Data Catalog)Note how the First and Last column names are identical if lowercased
toColumnswalksmetadata.schemas()and emits both historical and current names; case-sensitive dedupe letsIdandidthrough as distinct entriescolumns=[Id, id, id__tmp].Column.nameon persist, collapsing them into twoname="id"rows that both carry a non-null commentcolumns=[id, id, id__tmp].This is expected AWS Behavior.
Collectors.toMap→ duplicate-key throwsIllegalStateException.catch (RuntimeException e), leavingtableInputBuilder.storageDescriptorunset.glue.updateTable()with a nullStorageDescriptorwhich deletes the tables StorageDescriptor.The name collision itself isn't the bug trigger. Without comments, the
column.comment() != nullfilter strips the colliding entries beforetoMapsees them.Fix
First-seen-wins:
toColumnswrites current-schema entries before historical ones, so the current entry's comment is preserved.Glue still receives duplicate-name
Columns[]after a case-only rename, andaddColumnWithDedupeis untouched.Test
Adds
testSetTableInputInformationWithDuplicateExistingColumnComments: builds anexistingTablewith twoname="id"entries (each with a comment) and asserts a non-nullStorageDescriptorwith the first-seen comment.Pre-fix fails on null
storageDescriptor();Post-fix passes.
All 13 tests in
TestIcebergToGlueConverterpassPR description drafted with AI