Skip to content

[SPARK-57536][SS] Use maxOption instead of sorted.lastOption in HDFSMetadataLog#56596

Open
sarutak wants to merge 1 commit into
apache:masterfrom
sarutak:use-maxOption-in-HDFSMetadataLog
Open

[SPARK-57536][SS] Use maxOption instead of sorted.lastOption in HDFSMetadataLog#56596
sarutak wants to merge 1 commit into
apache:masterfrom
sarutak:use-maxOption-in-HDFSMetadataLog

Conversation

@sarutak

@sarutak sarutak commented Jun 18, 2026

Copy link
Copy Markdown
Member

What changes were proposed in this pull request?

This PR replaces listBatches.sorted.lastOption with listBatches.maxOption in HDFSMetadataLog.getLatestBatchId() and HDFSMetadataLog.getLatest().

Why are the changes needed?

The intent of the code is to find the maximum batch ID. sorted.lastOption sorts the entire array in O(n log n) to retrieve only the maximum element, while maxOption achieves the same result in O(n). These methods are called on every micro-batch in Structured Streaming, so avoiding unnecessary sorting reduces overhead for long-running streaming jobs with many batches.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

GA

Was this patch authored or co-authored using generative AI tooling?

Kiro CLI / Claude

…etadataLog

### What changes were proposed in this pull request?

Replace `listBatches.sorted.lastOption` with `listBatches.maxOption` in `HDFSMetadataLog.getLatestBatchId()` and `HDFSMetadataLog.getLatest()`.

### Why are the changes needed?

The intent of the code is to find the maximum batch ID. `sorted.lastOption` sorts the entire array in O(n log n) to retrieve only the maximum element, while `maxOption` achieves the same result in O(n). These methods are called on every micro-batch in Structured Streaming, so avoiding unnecessary sorting reduces overhead for long-running streaming jobs with many batches.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

### Was this patch authored or co-authored using generative AI tooling?

Yes.

@dongjoon-hyun dongjoon-hyun left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Good catch. In general, this is not the same because sorted.lastOption is Last Max and maxOption is First Max. However, in this context of Batch Id, it's correct. Thank you, @sarutak .

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-57536][SS] Use maxOption instead of sorted.lastOption in HDFSMetadataLog [SPARK-57536][SS] Use maxOption instead of sorted.lastOption in HDFSMetadataLog Jun 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants