Skip to content

Wrong metadata json when creating iceberg table from clickhouse directly #1898

@alsugiliazova

Description

@alsugiliazova

I created simple table using build from upstream pr #1896:

   ┌─statement──────────────────────────────────────────────────────────────────────────────┐
1. │ CREATE TABLE default.`ns1.table1`                                                     ↴│
   │↳(                                                                                     ↴│
   │↳    `col1` Nullable(Int32)                                                            ↴│
   │↳)                                                                                     ↴│
   │↳ENGINE = Iceberg('http://minio:9000/warehouse/data2/', 'admin', '[HIDDEN]', 'Parquet')↴│
   │↳PARTITION BY col1                                                                     ↴│
   │↳ORDER BY col1                                                                          │
   └────────────────────────────────────────────────────────────────────────────────────────┘

1 row in set. Elapsed: 0.001 sec. 

And inserted three values with one insert:


SELECT *
FROM `ns1.table1`
ORDER BY col1 ASC

Query id: 5f00f445-4e94-4c1c-99b6-6f4867c7ab17

   ┌─col1─┐
1. │    1 │
2. │    2 │
3. │    3 │
   └──────┘

Issues with metadata:

  1. Missing current-snapshot-id
"refs" : {
        "main" : {
            "snapshot-id" : 2164490262916510684,
            "type" : "branch"
        }
    },

But Iceberg says current-snapshot-id should match the current ID of the main branch in refs. Your metadata has refs.main.snapshot-id, but no top-level:

"current-snapshot-id": 2164490262916510684

The spec describes current-snapshot-id as the table’s current snapshot ID and says it must match the current ID of the main branch in refs.

  1. parent-snapshot-id: -1 should be omitted - The spec says parent-snapshot-id is optional and “omitted for any snapshot with no parent.”
  2. metadata-log probably should not point to the current metadata file
"metadata-log": [
  {
    "metadata-file": "/data2/metadata/v2.metadata.json",
    "timestamp-ms": 1781007525135
  }
]

Iceberg’s metadata-log is meant to track previous metadata files, not the current one. If this JSON file is itself v2.metadata.json, then the log entry should usually point to v1.metadata.json, or be empty/omitted for a first metadata file. The spec says each new metadata file adds the previous metadata file location to the log.

  1. location may be questionable
"location": "/data2/"

This may work in a local filesystem setup, but it depends on the catalog/engine. In many Iceberg setups this would be something like:

"location": "file:/data2"

or

"location": "s3://bucket/path/table"

The spec treats location as the table’s base location; for newer format descriptions, it notes that when present it must be an absolute path.

Metadata

Metadata

Assignees

Labels

antalyabugSomething isn't working

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions