Skip to content

Spark: Add selective shredded variant extraction Parquet readers#16714

Open
qlong wants to merge 2 commits into
apache:mainfrom
qlong:variant-extraction-parquet-io
Open

Spark: Add selective shredded variant extraction Parquet readers#16714
qlong wants to merge 2 commits into
apache:mainfrom
qlong:variant-extraction-parquet-io

Conversation

@qlong

@qlong qlong commented Jun 8, 2026

Copy link
Copy Markdown

Changes

This PR is part of the work to support variant extraction pushdown, the core change is to introduce new parquet readers to read selected variant paths instead of the whole variant:

  • Add selective Parquet readers (ParquetVariantExtractionReaders, VariantExtractionPathResolver) to read only shredded typed_value columns for requested extraction paths.
  • Add Spark row reader adapter (SparkVariantExtractionReaders, SparkParquetReaders) to materialize extraction slots from the engine read schema instead of full variant blobs.
  • Wire engine read schema from SparkBatch through SparkInputPartition to RowDataReader only (row Parquet path).
  • Update PruneColumnsWithoutReordering so annotated extraction structs map back to Iceberg VARIANT columns in the scan projection.

Issue: #16726

Note for reviewers

End to end testing

Requires #16715 for end-to-end testing. To try the full pushdown + selective read path without merging locally, use this branch:

https://github.com/qlong/iceberg/tree/variant-extraction-integration-test

Test results

Use 1-day Github activities data, ingested as json and shredded variants with 299 shreddred columns.

Baseline: gha-payload-iceberg-20260605 · variant + extraction pushdown ON + selective shredded variant extraction Parquet readers
Compare A: same run with payload stored as string_json
Compare B: gha-payload-iceberg-nopushed-20260605 · variant + pushdown OFF, read whole variant
Median of 3 timed runs per query (Spark Time taken:).

Query Variant + pushdown (s) string_json (s) Δ vs baseline Variant no-pushdown (s) Δ vs baseline
c-q01 2.605 2.373 −8.9% 2.945 +13.0%
c-q04 3.875 6.412 +65.5% 72.082 +1760%
c-q05b 3.506 5.683 +62.1% 39.154 +1017%
c-q06 4.668 6.583 +41.0% 76.935 +1548%
c-q07 4.714 4.490 −4.8% 75.033 +1492%
c-q08 3.701 4.707 +27.2% 87.059 +2252%
c-q09 5.102 6.668 +30.7% 72.560 +1322%
c-q10 4.395 6.568 +49.4% 68.179 +1451%
c-q11 4.495 6.384 +42.0% 67.985 +1413%
c-q12 3.995 4.284 +7.2% 70.509 +1665%
c-q13 3.911 4.060 +3.8% 39.614 +913%
c-q14 4.769 5.450 +14.3% 63.331 +1228%
Total (Σ) 49.74 63.66 +28.0% 735.39 +1379%

Co-authored with Claude Sonnet 4.6

- Add selective Parquet readers (ParquetVariantExtractionReaders,
  VariantExtractionPathResolver) to read only shredded typed_value
  columns for requested extraction paths.
- Add Spark row reader adapter (SparkVariantExtractionReaders,
  SparkParquetReaders) to materialize extraction slots from the engine
  read schema instead of full variant blobs.
- Wire engine read schema from SparkBatch through SparkInputPartition to
  RowDataReader only (row Parquet path).
- Update PruneColumnsWithoutReordering so annotated extraction structs
  map back to Iceberg VARIANT columns in the scan projection.

Issue: apache#16726
@qlong qlong force-pushed the variant-extraction-parquet-io branch from 8fdff7c to 2185f2a Compare June 8, 2026 16:02
@qlong

qlong commented Jun 8, 2026

Copy link
Copy Markdown
Author

@rdblue @steveloughran @nssalian PTAL when you get a chance.

}

@SuppressWarnings("checkstyle:CyclomaticComplexity")
private static Object toSparkValue(VariantValue value, DataType targetType) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may be missing something, but this looks like it overlaps with Spark's variant_get casting logic. toSparkValue handles the common cases, but it seems separate from Spark's existing behavior around failOnError, timeZoneId, and some cast edge cases.

Would it make sense to reuse Spark's cast path here if we can bridge from Iceberg's VariantValue to Spark's Variant / VariantVal?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for review. I assume you were referring VariantGet.cast in spark. toSparkValue in the connector is  required according to  DSV2 extraction pushdown contract. When Spark pushes extraction down, it delegates extraction and cast to the connector, the engine no longer calls VarianGett.cast one the values returned from connector.  The bridge from iceberg's VariantValue to spark's Variant already exists, it is triggered when extaction pushdown was rejected and connector returns the whole variant. It is expensive for shredded typed value, as they are put back into VariantValue then immediately extraced by Spark again.

The cast logic in Spark is more general than needed here, it handles cross-type coercisons that do not apply to typed_value, with the exception of type narrowing overflow. I added a fix for that.

import org.apache.iceberg.relocated.com.google.common.collect.Lists;
import org.apache.iceberg.relocated.com.google.common.collect.Streams;

public class PathUtil {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I am not mistaken, this utility is only used for variant extraction/shredding paths. Should we rename it to something like VariantPathUtil or VariantPath so it is clear this is not a general Iceberg path utility?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct, agree that VariantPathUtil is a better name. I am going to defer the change for now, since this file is copied from #15384 to avoid stacked PRs. There are other in-fligh PRs that also copy this file. I will put up a follow up PR to rename once they are merged.

Comment on lines +107 to +110
if (segment instanceof PathSegment.Name) {
parts.add(((PathSegment.Name) segment).name());
} else if (segment instanceof PathSegment.Index) {
parts.add("[" + ((PathSegment.Index) segment).index() + "]");

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this loses some path semantics. PathUtil.parse distinguishes object names from array indexes, but parseObjectPath flattens both into string parts. For example, $[0] means array element 0, while $['[0]'] means an object field whose name is literally [0]; after flattening, both are represented as the same string segment "[0]".

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call out. I removed parseObjectPath, and pass List through the Parquet extraction readers. Added tests

import org.apache.iceberg.relocated.com.google.common.collect.Lists;
import org.apache.iceberg.relocated.com.google.common.collect.Streams;

public class PathUtil {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slightly broader question: would it make sense to use an existing JSONPath parser here and then validate that the parsed path is within Iceberg's supported subset? This is probably the fourth project where I have seen a new VariantPath-style implementation over the past year, so I am a bit worried about adding another one unless we keep the scope very explicit.

For example, we could allow simple/singular paths like field access and array indexes, while rejecting wildcards, recursive descent, slices, and filter expressions for now.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is iceberg has strict dependency hygiene, adding new lib would require review and could add transtive dependences. The parser from #15384 is intentionlly minimal.. We probably should look into dedicated lib if we need to support wildcard, filter expressions.

@qlong qlong force-pushed the variant-extraction-parquet-io branch 2 times, most recently from a9e7ef1 to 4de911e Compare June 12, 2026 19:09
Address review feedback:

- Remove parseObjectPath and related PathUtil string helpers so PathUtil
  stays aligned with the companion pushdown branch.
- Pass List<PathSegment> through the Parquet extraction readers, path
  resolver, and Spark wiring to preserve array index vs object fieldq
  semantics during navigation.
- Return null if the target type is narrower than the value type and
  overflows.
@qlong qlong force-pushed the variant-extraction-parquet-io branch from 4de911e to 4227e75 Compare June 12, 2026 19:24
@qlong qlong requested a review from tmater June 15, 2026 19:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants