Skip to content

[SPARK-57517][SQL] Fix schema_of_json to return proper error on non-string literal input#56582

Open
jubins wants to merge 1 commit into
apache:masterfrom
jubins:j-SPARK-57517-fix-class-cast-exception
Open

[SPARK-57517][SQL] Fix schema_of_json to return proper error on non-string literal input#56582
jubins wants to merge 1 commit into
apache:masterfrom
jubins:j-SPARK-57517-fix-class-cast-exception

Conversation

@jubins

@jubins jubins commented Jun 18, 2026

Copy link
Copy Markdown

What is the purpose of the change

Fixes SPARK-57517schema_of_json throws a ClassCastException during analysis when called with a non-string literal (e.g., SELECT schema_of_json(42)), instead of surfacing a clean DATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPE error.

The root cause is in SchemaOfJson.checkInputDataTypes(): it references a lazy val json = child.eval().asInstanceOf[UTF8String] before verifying that the child's type is StringType. For an integer literal, the asInstanceOf[UTF8String] cast throws ClassCastException at analysis time rather than producing a user-facing error.

The companion functions schema_of_csv and schema_of_xml were fixed for the same issue in SPARK-52234, but schema_of_json was missed. This PR applies the same fix: restructuring checkInputDataTypes to check !foldableeval() == nulldataType != StringType in safe order, and removing the unsafe lazy val entirely.

Brief change log

  • SchemaOfJson.checkInputDataTypes(): removed the lazy val json that performed an unsafe asInstanceOf[UTF8String] cast; restructured the condition chain to check for non-foldable input, null input, and wrong type (adding a new UNEXPECTED_INPUT_TYPE branch) before delegating to super.checkInputDataTypes()
  • Added select schema_of_json(42) to json-functions.sql input
  • Added corresponding DATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPE expected entries to analyzer-results/json-functions.sql.out and results/json-functions.sql.out

Verifying this change

This change is covered by golden file SQL query tests in SQLQueryTestSuite:

  • select schema_of_json(42) — verifies that a non-string integer literal produces DATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPE at analysis time (previously threw ClassCastException)
  • Existing tests for schema_of_json(null) and schema_of_json(nonFoldableColumn) continue to pass, confirming the null and non-foldable branches are unaffected

Does this pull request potentially affect one of the following parts

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): no — SchemaOfJson is an internal catalyst expression
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): no — only affects the analysis-time type check path
  • Anything that affects deployment or recovery: no
  • The S3 file system connector: no

Documentation

Does this pull request introduce a new feature? no

If yes, how is the feature documented? not applicable

Was generative AI tooling used to co-author this PR?

  • Yes — Claude Code was used as a pair-programming assistant. All code was written, understood, and verified by the author.

Generated-by: Claude Opus 4.8

@MaxGekk

MaxGekk commented Jun 18, 2026

Copy link
Copy Markdown
Member

LGTM overall — the fix is a faithful port of the SPARK-52234 change already in schema_of_csv / schema_of_xml (the condition chain is character-identical to both siblings, removing the unsafe asInstanceOf[UTF8String] cast), and the added schema_of_json(42) test mirrors the csv/xml suites.

One thing to fix before merge: the golden files need to be regenerated. CI (sql - extended tests) fails on select schema_of_json(42) because the hardcoded queryContext offset is off by one — the files have "stopIndex" : 24, but the analyzer emits 25 (the closing ) of schema_of_json(42) is at column 25, and the fragment string already spans 8→25, so 24 is internally inconsistent). This is the usual symptom of hand-editing a golden file instead of regenerating it.

Please regenerate rather than hand-editing:

SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z json-functions.sql"

Then double-check the diff — only the two stopIndex values (24 → 25) in results/json-functions.sql.out and analyzer-results/json-functions.sql.out should change.

@jubins

jubins commented Jun 18, 2026

Copy link
Copy Markdown
Author

LGTM overall — the fix is a faithful port of the SPARK-52234 change already in schema_of_csv / schema_of_xml (the condition chain is character-identical to both siblings, removing the unsafe asInstanceOf[UTF8String] cast), and the added schema_of_json(42) test mirrors the csv/xml suites.

One thing to fix before merge: the golden files need to be regenerated. CI (sql - extended tests) fails on select schema_of_json(42) because the hardcoded queryContext offset is off by one — the files have "stopIndex" : 24, but the analyzer emits 25 (the closing ) of schema_of_json(42) is at column 25, and the fragment string already spans 8→25, so 24 is internally inconsistent). This is the usual symptom of hand-editing a golden file instead of regenerating it.

Please regenerate rather than hand-editing:

SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z json-functions.sql"

Then double-check the diff — only the two stopIndex values (24 → 25) in results/json-functions.sql.out and analyzer-results/json-functions.sql.out should change.

Thanks for the review! Fixed, updated stopIndex from 24 to 25 in both results/json-functions.sql.out and analyzer-results/json-functions.sql.out. That was the only diff.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants