[SPARK-57517][SQL] Fix schema_of_json to return proper error on non-string literal input by jubins · Pull Request #56582 · apache/spark

jubins · 2026-06-18T00:02:58Z

What is the purpose of the change

Fixes SPARK-57517 — schema_of_json throws a ClassCastException during analysis when called with a non-string literal (e.g., SELECT schema_of_json(42)), instead of surfacing a clean DATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPE error.

The root cause is in SchemaOfJson.checkInputDataTypes(): it references a lazy val json = child.eval().asInstanceOf[UTF8String] before verifying that the child's type is StringType. For an integer literal, the asInstanceOf[UTF8String] cast throws ClassCastException at analysis time rather than producing a user-facing error.

The companion functions schema_of_csv and schema_of_xml were fixed for the same issue in SPARK-52234, but schema_of_json was missed. This PR applies the same fix: restructuring checkInputDataTypes to check !foldable → eval() == null → dataType != StringType in safe order, and removing the unsafe lazy val entirely.

Brief change log

SchemaOfJson.checkInputDataTypes(): removed the lazy val json that performed an unsafe asInstanceOf[UTF8String] cast; restructured the condition chain to check for non-foldable input, null input, and wrong type (adding a new UNEXPECTED_INPUT_TYPE branch) before delegating to super.checkInputDataTypes()
Added select schema_of_json(42) to json-functions.sql input
Added corresponding DATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPE expected entries to analyzer-results/json-functions.sql.out and results/json-functions.sql.out

Verifying this change

This change is covered by golden file SQL query tests in SQLQueryTestSuite:

select schema_of_json(42) — verifies that a non-string integer literal produces DATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPE at analysis time (previously threw ClassCastException)
Existing tests for schema_of_json(null) and schema_of_json(nonFoldableColumn) continue to pass, confirming the null and non-foldable branches are unaffected

Does this pull request potentially affect one of the following parts

Dependencies (does it add or upgrade a dependency): no
The public API, i.e., is any changed class annotated with @Public(Evolving): no — SchemaOfJson is an internal catalyst expression
The serializers: no
The runtime per-record code paths (performance sensitive): no — only affects the analysis-time type check path
Anything that affects deployment or recovery: no
The S3 file system connector: no

Documentation

Does this pull request introduce a new feature? no

If yes, how is the feature documented? not applicable

Was generative AI tooling used to co-author this PR?

Yes — Claude Code was used as a pair-programming assistant. All code was written, understood, and verified by the author.

Generated-by: Claude Opus 4.8

…tring literal input

MaxGekk · 2026-06-18T09:36:07Z

LGTM overall — the fix is a faithful port of the SPARK-52234 change already in schema_of_csv / schema_of_xml (the condition chain is character-identical to both siblings, removing the unsafe asInstanceOf[UTF8String] cast), and the added schema_of_json(42) test mirrors the csv/xml suites.

One thing to fix before merge: the golden files need to be regenerated. CI (sql - extended tests) fails on select schema_of_json(42) because the hardcoded queryContext offset is off by one — the files have "stopIndex" : 24, but the analyzer emits 25 (the closing ) of schema_of_json(42) is at column 25, and the fragment string already spans 8→25, so 24 is internally inconsistent). This is the usual symptom of hand-editing a golden file instead of regenerating it.

Please regenerate rather than hand-editing:

SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z json-functions.sql"

Then double-check the diff — only the two stopIndex values (24 → 25) in results/json-functions.sql.out and analyzer-results/json-functions.sql.out should change.

jubins · 2026-06-18T17:42:08Z

LGTM overall — the fix is a faithful port of the SPARK-52234 change already in schema_of_csv / schema_of_xml (the condition chain is character-identical to both siblings, removing the unsafe asInstanceOf[UTF8String] cast), and the added schema_of_json(42) test mirrors the csv/xml suites.

One thing to fix before merge: the golden files need to be regenerated. CI (sql - extended tests) fails on select schema_of_json(42) because the hardcoded queryContext offset is off by one — the files have "stopIndex" : 24, but the analyzer emits 25 (the closing ) of schema_of_json(42) is at column 25, and the fragment string already spans 8→25, so 24 is internally inconsistent). This is the usual symptom of hand-editing a golden file instead of regenerating it.

Please regenerate rather than hand-editing:
SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z json-functions.sql"
Then double-check the diff — only the two stopIndex values (24 → 25) in results/json-functions.sql.out and analyzer-results/json-functions.sql.out should change.

Thanks for the review! Fixed, updated stopIndex from 24 to 25 in both results/json-functions.sql.out and analyzer-results/json-functions.sql.out. That was the only diff.

[SPARK-57517][SQL] Fix schema_of_json to return proper error on non-s…

621823f

…tring literal input

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-57517][SQL] Fix schema_of_json to return proper error on non-string literal input#56582

[SPARK-57517][SQL] Fix schema_of_json to return proper error on non-string literal input#56582
jubins wants to merge 1 commit into
apache:masterfrom
jubins:j-SPARK-57517-fix-class-cast-exception

jubins commented Jun 18, 2026

Uh oh!

MaxGekk commented Jun 18, 2026

Uh oh!

jubins commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jubins commented Jun 18, 2026

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts

Documentation

Was generative AI tooling used to co-author this PR?

Uh oh!

MaxGekk commented Jun 18, 2026

Uh oh!

jubins commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants