feat: Add native collect_list aggregate support#4727
Draft
peterxcli wants to merge 5 commits into
Draft
Conversation
hsiang-c
reviewed
Jun 25, 2026
| * binary). However, the native Comet aggregate produces the actual state type (e.g., | ||
| * ArrayType(elementType) for CollectSet). This method corrects the output schema to match the | ||
| * native state types so the shuffle exchange schema is consistent with the actual data. | ||
| * For intermediate aggregates containing TypedImperativeAggregate functions (like CollectSet or |
Contributor
There was a problem hiding this comment.
👍 Could we add a few SQL file tests: https://datafusion.apache.org/comet/contributor-guide/sql-file-tests.html#sql-file-tests, thank you!
…o multi-stage-distinct-combined-collect-list-collect-set
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Closes #4724.
Rationale for this change
collect_list/array_aggandcollect_setuse SparkTypedImperativeAggregatebuffers that Spark declares as serializedBinaryType, while Comet’s native implementations keep the real aggregate state as an Arrow/SparkArrayType.The existing schema adjustment only handled simple two-stage
Partial -> Finalcollect aggregates. Spark’s distinct-aggregate rewrite can introduce multi-stage plans withPartialMergestages, for example:Without correcting the intermediate buffer schema for these stages, a fully-native pipeline can fail when native list state is treated as Spark binary state. This change makes the native array state round-trip through
Partial,PartialMerge, and mixed{Partial, PartialMerge}stages socollect_list/collect_setcan run fully native in distinct-combined aggregate plans.What changes are included in this PR?
Adds native
collect_list/array_aggaggregate support:CollectListto the aggregate expression proto.CollectList.CollectList -> CometCollectList.datafusion_spark::function::aggregate::collect::SparkCollectList.Extends collect aggregate native-state schema adjustment:
CometObjectHashAggregateExec.adjustOutputForNativeStateto handle bothCollectListandCollectSet.Partial,PartialMerge, and mixed{Partial, PartialMerge}stages, not only purePartial.BinaryTypebuffer attributes to Comet’s nativeArrayType(elementType, containsNull = true)state type.Adds regression coverage for fully-native distinct-combined collect aggregates.
Updates expression support docs and aggregate audit notes for
collect_list/array_agg.How are these changes tested?
Ran: