fix: gate non-default collations for Spark 4 datetime expressions#4693
fix: gate non-default collations for Spark 4 datetime expressions#46930lai0 wants to merge 6 commits into
Conversation
| " `TimestampNTZType` is not supported because Comet incorrectly applies timezone" + | ||
| " conversion to TimestampNTZ values.") | ||
|
|
||
| override def getIncompatibleReasons(): Seq[String] = Seq(collationReason) |
There was a problem hiding this comment.
minor nit: this incompatibility will get documented for all Spark versions even though it is specific to Spark 4.x
I wonder if we can shim getIncompatibleReasons so it only applies for 4.x?
There was a problem hiding this comment.
Sure, makes sense! I will move it into CometTypeShim so the reason appears only on 4.x, just as hasNonDefaultStringCollation is already shimmed.
I'll push an update shortly. Thanks for review
| isUtc || CometConf.isExprAllowIncompat(getExprConfigName(expr)) | ||
| } | ||
| val canUseNative = nativeFormat.isDefined && | ||
| !expr.children.exists(c => hasNonDefaultStringCollation(c.dataType)) && { |
There was a problem hiding this comment.
This seems incorrect, or is at least inconsistent with other date/time functions. It seems there is no way to enable the native incompatible expression now due to the hasNonDefaultStringCollation check introduced here?
PR looks good other than this issue.
There was a problem hiding this comment.
Thanks for catching this!!
I missed it. The collation check in canUseNative was incorrectly blocking the native path even when DateFormatClass.allowIncompatible=true.
I removed the guard so that the non-default collation case is now handled by getSupportLevel() as Incompatible. When allowIncompatible is false, the request still routes through the dispatcher/fallback, but when it is true, the native path can be used.
I have also added regression coverage in CometCollationSuite for Spark 4.0 and 4.1 to handle cases involving date_format with a collated literal format and DateFormatClass.allowIncompatible=true.
Resolve CometDateFormat conflict by combining collation gating with the native opt-in and codegen-dispatch routing from main.
| "date_format does not support non-UTF8_BINARY collations") | ||
| } | ||
|
|
||
| test("date_format uses native path with collated format when allowIncompatible is enabled") { |
There was a problem hiding this comment.
minor nit: this test is duplicated between 4.1 and 4.2 shims and could move to the 4.1+ shim, which was recently introduced. We can do this later.
Which issue does this PR close?
Closes #4646.
Rationale for this change
Spark 4.0 adds
StringTypeWithCollationsupport to several datetime expression string inputs. Comet previously did not distinguish non-default collations for these inputs, so affected expressions could be planned as compatible even though native execution does not honor collation semantics.What changes are included in this PR?
convert_timezone, date_format, date_trunc, from_unixtime, make_timestamp, next_day, to_unix_timestamp, trunc, and unix_timestamp.LocalTableScanmasking the expression fallback.How are these changes tested?
./mvnw compile test-compile -Pspark-4.0 -DskipTests./mvnw test -Pspark-4.0 -Dtest=none -Dsuites=org.apache.spark.sql.CometCollationSuite./mvnw compile test-compile -Pspark-4.1 -DskipTests./mvnw test -Pspark-4.1 -Dtest=none -Dsuites=org.apache.spark.sql.CometCollationSuite