Skip to content

feat(writer): embed iceberg.schema in Parquet footer metadata#2724

Open
viirya wants to merge 1 commit into
apache:mainfrom
viirya:fix/2184-iceberg-schema-footer
Open

feat(writer): embed iceberg.schema in Parquet footer metadata#2724
viirya wants to merge 1 commit into
apache:mainfrom
viirya:fix/2184-iceberg-schema-footer

Conversation

@viirya

@viirya viirya commented Jun 27, 2026

Copy link
Copy Markdown
Member

Which issue does this PR close?

What changes are included in this PR?

Engines such as Snowflake resolve an Iceberg table's schema from the iceberg.schema key in a Parquet file's footer key-value metadata. iceberg-rust didn't write this key, so Parquet files it produced (or files produced by nimtable compaction on top of it) were rejected by those engines.

This PR writes the Iceberg schema as JSON under the iceberg.schema footer key when the Parquet writer is initialized, matching iceberg-java. I verified the Java behavior against the source:

  • parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java writes meta("iceberg.schema", SchemaParser.toJson(schema)) unconditionally in WriteBuilder.build().
  • The value is the full Iceberg Schema JSON — the same representation that appears in table metadata's schemas array. iceberg-rust's serde_json::to_string(&schema) produces that same JSON.

Implementation: in ParquetWriter, right after the underlying AsyncArrowWriter is lazily created, append_key_value_metadata is called with iceberg.schema → schema JSON. It's unconditional, matching Java (the schema is always present).

Scope

Parquet writer only. iceberg-java writes the same iceberg.schema key from its Avro writer as well, but in iceberg-rust the Avro writer produces manifests (metadata), not the data files these engines query, so it's out of scope here. Adding it to the Avro path can be a follow-up if there's a need.

Are these changes tested?

New test test_parquet_writer_embeds_iceberg_schema_in_footer: writes a Parquet file through ParquetWriter, reads the footer back, asserts the iceberg.schema key is present, and that its JSON value round-trips to the written Schema.

All writer::file_writer::parquet_writer tests pass (no regression), full iceberg lib suite (1372 tests) passes, clippy and rustfmt clean.

Engines such as Snowflake resolve an Iceberg table's schema from the
`iceberg.schema` key in a Parquet file's footer key-value metadata. iceberg-rust
did not write this key, so Parquet files it produced (or files produced by
nimtable compaction on top of it) were rejected by those engines.

Write the Iceberg schema as JSON under the `iceberg.schema` footer key when the
Parquet writer is initialized, matching iceberg-java (`Parquet.java`). The value
is the same schema JSON that appears in table metadata's `schemas`, produced via
`serde_json::to_string`.

Scope is the Parquet writer only. iceberg-java writes the same key from its Avro
writer too, but in iceberg-rust the Avro writer produces manifests (metadata),
not the data files these engines query, so that is left as a follow-up.

Closes apache#2184
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add iceberg.schema to footer for engine compatibility

1 participant