Skip to content

AWS: Handle duplicate column names in IcebergToGlueConverter comment map#16853

Open
Wyatt-Hawes wants to merge 2 commits into
apache:mainfrom
Wyatt-Hawes:aws-glue-converter-duplicate-column-comments
Open

AWS: Handle duplicate column names in IcebergToGlueConverter comment map#16853
Wyatt-Hawes wants to merge 2 commits into
apache:mainfrom
Wyatt-Hawes:aws-glue-converter-duplicate-column-comments

Conversation

@Wyatt-Hawes

Copy link
Copy Markdown

What

IcebergToGlueConverter.setTableInputInformation builds a name → comment map from the existing Glue table via Collectors.toMap(Column::name, Column::comment), which throws IllegalStateException on duplicate keys.

Intended Fix: Add an existing-entry-wins merge function so duplicates do not throw.

How this happens

The following RENAME COLUMN cycle (id → id__tmp → Id): (Used SPARK with Glue Data Catalog)
Note how the First and Last column names are identical if lowercased

  1. By design, toColumns walks metadata.schemas() and emits both historical and current names; case-sensitive dedupe lets Id and id through as distinct entries columns=[Id, id, id__tmp].
  2. AWS Glue lowercases Column.name on persist, collapsing them into two name="id" rows that both carry a non-null comment columns=[id, id, id__tmp].
    This is expected AWS Behavior.
  3. The next commit hits Collectors.toMap → duplicate-key throws IllegalStateException.
  4. The throw is swallowed by the outer catch (RuntimeException e), leaving tableInputBuilder.storageDescriptor unset.
  5. The engine callsglue.updateTable() with a null StorageDescriptor which deletes the tables StorageDescriptor.

The name collision itself isn't the bug trigger. Without comments, the column.comment() != null filter strips the colliding entries before toMap sees them.

Fix

First-seen-wins: toColumns writes current-schema entries before historical ones, so the current entry's comment is preserved.
Glue still receives duplicate-name Columns[] after a case-only rename, and addColumnWithDedupe is untouched.

Test

Adds testSetTableInputInformationWithDuplicateExistingColumnComments: builds an existingTable with two name="id" entries (each with a comment) and asserts a non-null StorageDescriptor with the first-seen comment.

Pre-fix fails on null storageDescriptor();
Post-fix passes.

All 13 tests in TestIcebergToGlueConverter pass


PR description drafted with AI

@github-actions github-actions Bot added the AWS label Jun 17, 2026
@Wyatt-Hawes Wyatt-Hawes marked this pull request as ready for review June 17, 2026 23:04
.collect(Collectors.toMap(Column::name, Column::comment));
.collect(
Collectors.toMap(
Column::name, Column::comment, (existing, duplicate) -> existing));

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about leaving a code comment explaining why existing should win? For me, it's unclear when reading this code.

@Wyatt-Hawes Wyatt-Hawes Jun 18, 2026

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that a clarifying comment would be helpful here.

Updated the commit to add a comment clarifying why existing entries should win.

@Wyatt-Hawes Wyatt-Hawes force-pushed the aws-glue-converter-duplicate-column-comments branch from 3c44d6b to b650c50 Compare June 18, 2026 16:59

@CTTY CTTY left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for the fix

@@ -266,10 +266,14 @@ static void setTableInputInformation(
Map<String, String> existingColumnMap = null;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think we should rename this map to columnNameToComment or existingColumnNameToComment, same goes for the naming of this map in toColumns and addColumnWithDedupe. The current name is confusing

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. The current name is misleading. Adding a second commit to this PR to resolve this.

}

@Test
public void testSetTableInputInformationWithDuplicateExistingColumnComments() {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: How about just testDuplicateExistingColumnsKeepFirstComment

…t map

## What changed
`IcebergToGlueConverter.setTableInputInformation(TableInput.Builder, TableMetadata, Table)`
builds an `existingColumnMap` (column name -> comment) from the existing Glue
table's columns to recover comments for fields whose new Iceberg schema has no
`doc()`. The current `Collectors.toMap(Column::name, Column::comment)` throws
`IllegalStateException` on duplicate keys.

This change adds a first-seen-wins merge function so duplicate keys collapse
instead of throwing.

## Why
Duplicate `Column.name` entries can legitimately appear in an existing Glue
table's `Columns[]` after a case-only `RENAME COLUMN` cycle (e.g.
`id -> id__tmp -> Id`):

1. `RENAME COLUMN` only changes the field name; `field.doc()` is unchanged.
2. The converter walks `metadata.schemas()` in `toColumns` and emits
   historical names alongside the current name; case-sensitive dedupe lets
   `Id` and `id` through as distinct entries with the same comment.
3. Glue normalizes `Column.name` case-insensitively on persist, collapsing
   `Id` and `id` into two `name="id"` rows that both carry a non-null comment.
4. Next commit: the stream above hits `Collectors.toMap` -> duplicate-key throw.
5. The throw is swallowed by the outer `catch (RuntimeException e) { LOG.warn(...) }`,
   leaving `tableInputBuilder.storageDescriptor` unset.
6. `glue.updateTable()` is then called with `TableInput.storageDescriptor=null`,
   producing a destructive write -- every subsequent read sees a null storage
   descriptor.

The case-collision alone is not the bug; without comments the
`column.comment() != null` filter strips the colliding entries and the throw
never fires. The failure surfaces only when both colliding columns also have
non-null comments, because that's what the comment-recovery map keys on.

## Fix
First-seen-wins is the right semantic here: `toColumns` writes current-schema
entries before historical ones, so the current entry's comment is preserved on
the read-back. The converter's emit behavior is unchanged -- Glue still
receives duplicate-name `Columns[]` after a case-only rename, so time-travel
and Hive-fallback consumers depending on historical names appearing in
`Columns[]` are unaffected. `addColumnWithDedupe` is also untouched.

A case-insensitive write-side dedupe was considered and rejected -- it would
drop historical names from `Columns[]` and change the converter's documented
projection.

## Properties of this fix
- Resolves the `IllegalStateException` and the resulting null-storage-descriptor
  `UpdateTable` calls.
- Heals already-broken tables on the next commit -- no manual dedupe required.
- No change to the converter's emit behavior.
- `addColumnWithDedupe` is untouched.

## Test
Adds `TestIcebergToGlueConverter.testSetTableInputInformationWithDuplicateExistingColumnComments`,
which constructs an `existingTable` with the broken `Columns[]` shape (two
`name="id"` entries, both with comments) and asserts that
`setTableInputInformation` produces a non-null `StorageDescriptor` with the
correct first-seen comment. Pre-fix, the test fails on a null
`storageDescriptor()` (the destructive bug surfaced in-process); post-fix it
passes.

All 13 tests in `TestIcebergToGlueConverter` pass; spotless clean.
@Wyatt-Hawes Wyatt-Hawes force-pushed the aws-glue-converter-duplicate-column-comments branch from b650c50 to 10df000 Compare June 18, 2026 23:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants