Spark: Add selective shredded variant extraction Parquet readers by qlong · Pull Request #16714 · apache/iceberg

qlong · 2026-06-08T00:15:21Z

Changes

This PR is part of the work to support variant extraction pushdown, the core change is to introduce new parquet readers to read selected variant paths instead of the whole variant:

Add selective Parquet readers (ParquetVariantExtractionReaders, VariantExtractionPathResolver) to read only shredded typed_value columns for requested extraction paths.
Add Spark row reader adapter (SparkVariantExtractionReaders, SparkParquetReaders) to materialize extraction slots from the engine read schema instead of full variant blobs.
Wire engine read schema from SparkBatch through SparkInputPartition to RowDataReader only (row Parquet path).
Update PruneColumnsWithoutReordering so annotated extraction structs map back to Iceberg VARIANT columns in the scan projection.

Issue: #16726

Note for reviewers

PathUtil.java is mostly copied from the existing PR Api: Support variant extract and fix manifest bounds byte order #15384, will rebase once that PR is merged.
To reduce the scope, the new selective readers are only wired in for batch row scan. We can wire to other readers as a follow up.
To reduce the scope, only supports extracting mostly used data types. Do not support extracting arrays, struct / nested struct.Request shredded columns for unsupported types will lead to read the whole variant (extraction pushdown rejected).
Merge order:
1. Spark: Add selective shredded variant extraction Parquet readers #16714
2. Spark: Implement variant extraction pushdown for shredded VARIANT columns #16715

End to end testing

Requires #16715 for end-to-end testing. To try the full pushdown + selective read path without merging locally, use this branch:

https://github.com/qlong/iceberg/tree/variant-extraction-integration-test

Test results

Use 1-day Github activities data, ingested as json and shredded variants with 299 shreddred columns.

Baseline: gha-payload-iceberg-20260605 · variant + extraction pushdown ON + selective shredded variant extraction Parquet readers
Compare A: same run with payload stored as string_json
Compare B: gha-payload-iceberg-nopushed-20260605 · variant + pushdown OFF, read whole variant
Median of 3 timed runs per query (Spark Time taken:).

Query	Variant + pushdown (s)	string_json (s)	Δ vs baseline	Variant no-pushdown (s)	Δ vs baseline
c-q01	2.605	2.373	−8.9%	2.945	+13.0%
c-q04	3.875	6.412	+65.5%	72.082	+1760%
c-q05b	3.506	5.683	+62.1%	39.154	+1017%
c-q06	4.668	6.583	+41.0%	76.935	+1548%
c-q07	4.714	4.490	−4.8%	75.033	+1492%
c-q08	3.701	4.707	+27.2%	87.059	+2252%
c-q09	5.102	6.668	+30.7%	72.560	+1322%
c-q10	4.395	6.568	+49.4%	68.179	+1451%
c-q11	4.495	6.384	+42.0%	67.985	+1413%
c-q12	3.995	4.284	+7.2%	70.509	+1665%
c-q13	3.911	4.060	+3.8%	39.614	+913%
c-q14	4.769	5.450	+14.3%	63.331	+1228%
Total (Σ)	49.74	63.66	+28.0%	735.39	+1379%

Co-authored with Claude Sonnet 4.6

- Add selective Parquet readers (ParquetVariantExtractionReaders, VariantExtractionPathResolver) to read only shredded typed_value columns for requested extraction paths. - Add Spark row reader adapter (SparkVariantExtractionReaders, SparkParquetReaders) to materialize extraction slots from the engine read schema instead of full variant blobs. - Wire engine read schema from SparkBatch through SparkInputPartition to RowDataReader only (row Parquet path). - Update PruneColumnsWithoutReordering so annotated extraction structs map back to Iceberg VARIANT columns in the scan projection. Issue: apache#16726

qlong · 2026-06-08T17:36:15Z

@rdblue @steveloughran @nssalian PTAL when you get a chance.

tmater · 2026-06-10T08:05:11Z

+  }
+
+  @SuppressWarnings("checkstyle:CyclomaticComplexity")
+  private static Object toSparkValue(VariantValue value, DataType targetType) {


I may be missing something, but this looks like it overlaps with Spark's variant_get casting logic. toSparkValue handles the common cases, but it seems separate from Spark's existing behavior around failOnError, timeZoneId, and some cast edge cases.

Would it make sense to reuse Spark's cast path here if we can bridge from Iceberg's VariantValue to Spark's Variant / VariantVal?

Thanks for review. I assume you were referring VariantGet.cast in spark. toSparkValue in the connector is required according to DSV2 extraction pushdown contract. When Spark pushes extraction down, it delegates extraction and cast to the connector, the engine no longer calls VarianGett.cast one the values returned from connector. The bridge from iceberg's VariantValue to spark's Variant already exists, it is triggered when extaction pushdown was rejected and connector returns the whole variant. It is expensive for shredded typed value, as they are put back into VariantValue then immediately extraced by Spark again.

The cast logic in Spark is more general than needed here, it handles cross-type coercisons that do not apply to typed_value, with the exception of type narrowing overflow. I added a fix for that.

tmater · 2026-06-10T08:46:20Z

+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
 import org.apache.iceberg.relocated.com.google.common.collect.Streams;

 public class PathUtil {


If I am not mistaken, this utility is only used for variant extraction/shredding paths. Should we rename it to something like VariantPathUtil or VariantPath so it is clear this is not a general Iceberg path utility?

You are correct, agree that VariantPathUtil is a better name. I am going to defer the change for now, since this file is copied from #15384 to avoid stacked PRs. There are other in-fligh PRs that also copy this file. I will put up a follow up PR to rename once they are merged.

tmater · 2026-06-10T08:48:32Z

+      if (segment instanceof PathSegment.Name) {
+        parts.add(((PathSegment.Name) segment).name());
+      } else if (segment instanceof PathSegment.Index) {
+        parts.add("[" + ((PathSegment.Index) segment).index() + "]");


I think this loses some path semantics. PathUtil.parse distinguishes object names from array indexes, but parseObjectPath flattens both into string parts. For example, $[0] means array element 0, while $['[0]'] means an object field whose name is literally [0]; after flattening, both are represented as the same string segment "[0]".

Good call out. I removed parseObjectPath, and pass List through the Parquet extraction readers. Added tests

tmater · 2026-06-10T09:08:29Z

+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
 import org.apache.iceberg.relocated.com.google.common.collect.Streams;

 public class PathUtil {


Slightly broader question: would it make sense to use an existing JSONPath parser here and then validate that the parsed path is within Iceberg's supported subset? This is probably the fourth project where I have seen a new VariantPath-style implementation over the past year, so I am a bit worried about adding another one unless we keep the scope very explicit.

For example, we could allow simple/singular paths like field access and array indexes, while rejecting wildcards, recursive descent, slices, and filter expressions for now.

My understanding is iceberg has strict dependency hygiene, adding new lib would require review and could add transtive dependences. The parser from #15384 is intentionlly minimal.. We probably should look into dedicated lib if we need to support wildcard, filter expressions.

Address review feedback: - Remove parseObjectPath and related PathUtil string helpers so PathUtil stays aligned with the companion pushdown branch. - Pass List<PathSegment> through the Parquet extraction readers, path resolver, and Spark wiring to preserve array index vs object fieldq semantics during navigation. - Return null if the target type is narrower than the value type and overflows.

github-actions Bot added API spark parquet labels Jun 8, 2026

qlong mentioned this pull request Jun 8, 2026

Spark: Implement variant extraction pushdown for shredded VARIANT columns #16715

Open

qlong force-pushed the variant-extraction-parquet-io branch 4 times, most recently from dde0bbe to 8fdff7c Compare June 8, 2026 01:19

qlong force-pushed the variant-extraction-parquet-io branch from 8fdff7c to 2185f2a Compare June 8, 2026 16:02

tmater reviewed Jun 10, 2026

View reviewed changes

qlong force-pushed the variant-extraction-parquet-io branch 2 times, most recently from a9e7ef1 to 4de911e Compare June 12, 2026 19:09

qlong force-pushed the variant-extraction-parquet-io branch from 4de911e to 4227e75 Compare June 12, 2026 19:24

qlong requested a review from tmater June 15, 2026 19:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark: Add selective shredded variant extraction Parquet readers#16714

Spark: Add selective shredded variant extraction Parquet readers#16714
qlong wants to merge 2 commits into
apache:mainfrom
qlong:variant-extraction-parquet-io

qlong commented Jun 8, 2026 •

edited

Loading

Uh oh!

qlong commented Jun 8, 2026

Uh oh!

tmater Jun 10, 2026

Uh oh!

qlong Jun 12, 2026

Uh oh!

tmater Jun 10, 2026

Uh oh!

qlong Jun 12, 2026

Uh oh!

tmater Jun 10, 2026

Uh oh!

qlong Jun 12, 2026

Uh oh!

tmater Jun 10, 2026

Uh oh!

qlong Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

qlong commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qlong commented Jun 8, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

qlong commented Jun 8, 2026 •

edited

Loading