Skip to content

[SPARK-57531][SQL] Reuse OrcTail in non-vectorized ORC reader path#56591

Open
cxzl25 wants to merge 1 commit into
apache:masterfrom
cxzl25:SPARK-57531
Open

[SPARK-57531][SQL] Reuse OrcTail in non-vectorized ORC reader path#56591
cxzl25 wants to merge 1 commit into
apache:masterfrom
cxzl25:SPARK-57531

Conversation

@cxzl25

@cxzl25 cxzl25 commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

This PR mirrors what buildColumnarReader already does since SPARK-44556: capture the readerOptions returned by createORCReader, then pass readerOptions.getOrcTail when constructing the per-split record reader so the footer is not re-read.

Why are the changes needed?

OrcPartitionReaderFactory.buildReader (the non-vectorized / row-based read path) previously called OrcInputFormat.createRecordReader(fileSplit, taskAttemptContext), which internally calls OrcFile.createReader without an OrcTail and therefore re-parses the file footer from storage on every split.
Without OrcTail reuse the non-vectorized path pays this cost a second time when opening the data reader for each split, while the vectorized path has been avoiding it since SPARK-44556.

Does this PR introduce any user-facing change?

No

How was this patch tested?

GHA

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code

@dongjoon-hyun dongjoon-hyun left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. Could you update DataSource v1 patch, OrcFileFormat.scala?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants