[SPARK-57515][SQL] Surface MALFORMED_CSV_RECORD instead of ArrayIndexOutOfBoundsException when CSV header exceeds maxColumns by jubins · Pull Request #56581 · apache/spark

jubins · 2026-06-17T22:58:43Z

What is the purpose of the change?

Fixes SPARK-57515. When reading a CSV file with header=true and the header line has more columns than maxColumns
(default 20480, user-configurable via .option("maxColumns", N)), Spark crashes with an internal
java.lang.ArrayIndexOutOfBoundsException instead of a clean MALFORMED_CSV_RECORD error.

SPARK-57195 (merged 2026-06-14) fixed the same ArrayIndexOutOfBoundsException for data rows and
explicitly called out the remaining gap: "Header rows are out of scope from this PR. A header over
maxColumns still surfaces the raw AIOOBE (CSVHeaderChecker), a pre-existing gap." This PR
closes that gap.

The bug affects all three CSV read paths handled by CSVHeaderChecker:

Non-multiLine file read — tokenizer.parseLine(header) was called directly, bypassing the
AIOOBE guard that UnivocityParser.parseLine wraps.
MultiLine file read — tokenizer.parseNext() during header consumption was unguarded.
Dataset[String] csv() — a fresh CsvParser was created and parser.parseLine(line) was
called directly.

Brief change log

CSVHeaderChecker.checkHeaderColumnNames(line: String): replaced parser.parseLine(line) with
UnivocityParser.parseLine(parser, line) to reuse the existing safe wrapper from SPARK-57195.
CSVHeaderChecker.checkHeaderColumnNames(tokenizer): wrapped tokenizer.parseNext() in a
try/catch that translates ArrayIndexOutOfBoundsException (bare or wrapped in
TextParsingException) into MALFORMED_CSV_RECORD.
CSVHeaderChecker.checkHeaderColumnNames(lines, tokenizer): wrapped tokenizer.parseLine(header)
in the same try/catch.
Added private helper malformedCsvHeaderRecord (mirrors UnivocityParser.malformedCsvRecord)
with the same bounded-record truncation to MAX_ERROR_CONTENT_LENGTH.

Verifying this change

Three new tests added to CSVSuite, one per affected path:

SPARK-57515: non-multiLine CSV read with header exceeding maxColumns surfaces
MALFORMED_CSV_RECORD — writes a 3-column CSV with maxColumns=2, asserts MALFORMED_CSV_RECORD
instead of AIOOBE.
SPARK-57515: multiLine CSV read with header exceeding maxColumns surfaces
MALFORMED_CSV_RECORD — same with multiLine=true.
SPARK-57515: Dataset[String] CSV read with header exceeding maxColumns surfaces
MALFORMED_CSV_RECORD — uses spark.createDataset path, asserts the header line appears in the
error message.

Does this PR potentially affect one of the following areas?

Dependencies: no
Public API: no — CSVHeaderChecker is internal
Serializers: no
Runtime per-record code paths (performance): no — only the header-parsing path, which runs once
per file
Deployment or recovery: no
S3 connector: no

Documentation

This PR does not introduce a new feature. No documentation changes needed.

Was generative AI tooling used to co-author this PR?

Yes — Claude Code was used as a pair-programming assistant. All code was written, understood, and
verified by the author.

Generated-by: Claude Opus 4.8

…OutOfBoundsException when CSV header exceeds maxColumns

MaxGekk

0 blocking, 2 non-blocking, 0 nits.
A clean, correct fix that faithfully mirrors the SPARK-57195 wrapper; both findings are optional polish.

Design / architecture (1)

CSVHeaderChecker.scala:175: malformedCsvHeaderRecord duplicates UnivocityParser.malformedCsvRecord verbatim (same package) — see inline

Suggestions (1)

CSVSuite.scala:3609: non-multiLine test under-asserts badRecord (.* where the header is deterministic) — see inline

Verification

Traced the univocity exception state-space across all three header-parse sites against the established UnivocityParser wrapper: TextParsingException whose cause is ArrayIndexOutOfBoundsException, and a bare ArrayIndexOutOfBoundsException, both translate to MALFORMED_CSV_RECORD identically to the data-row path; a TextParsingException with a null or non-AIOOBE cause still propagates unchanged (null-safe — getCause.isInstanceOf is false on null); badRecord semantics match the analogue per path. The three checkHeaderColumnNames overloads are the only header-parse entry points (callers in DataFrameReader and UnivocityParser), so the fix is complete.

MaxGekk · 2026-06-18T09:53:03Z

  }
+
+  // scalastyle:off line.size.limit
+  private def malformedCsvHeaderRecord(cause: Throwable, badRecord: String): SparkRuntimeException = {


malformedCsvHeaderRecord is a verbatim copy of UnivocityParser.malformedCsvRecord (UnivocityParser.scala:657-667) — same error class, same MAX_ERROR_CONTENT_LENGTH truncation, same SparkRuntimeException shape. Both classes are in the same package (org.apache.spark.sql.catalyst.csv), and the analogue is only object-private. Consider making it private[csv] and calling it from the three catch sites here, dropping this copy. That also removes a future-drift risk: a later change to MALFORMED_CSV_RECORD construction applied to one copy would silently diverge header vs. row malformed-record messages.

Optional deeper cleanup: both tokenizer callers pass a CsvParser (UnivocityParser.scala:600, 693), so narrowing the two private[csv] overloads' parameter type from AbstractParser[CsvParserSettings] to CsvParser would let the non-multiLine path reuse UnivocityParser.parseLine wholesale and drop its try/catch too.

MaxGekk · 2026-06-18T09:53:03Z

+        parameters = Map("badRecord" -> ".*"),
+        matchPVals = true)


This path passes the header verbatim as badRecord (CSVHeaderChecker.scala:164), so the exact value is deterministic and assertable here — as the Dataset[String] test below already does. .* passes regardless of the actual content, so it wouldn't catch a regression that put wrong/empty content in the error. (The multiLine test's .* is justified — its badRecord comes from getParsedContent, which isn't deterministic.)

Suggested change

parameters = Map("badRecord" -> ".*"),

matchPVals = true)

parameters = Map("badRecord" -> "a,b,c"),

matchPVals = false)

[SPARK-57515][SQL] Surface MALFORMED_CSV_RECORD instead of ArrayIndex…

8f017c6

…OutOfBoundsException when CSV header exceeds maxColumns

MaxGekk reviewed Jun 18, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-57515][SQL] Surface MALFORMED_CSV_RECORD instead of ArrayIndexOutOfBoundsException when CSV header exceeds maxColumns#56581

[SPARK-57515][SQL] Surface MALFORMED_CSV_RECORD instead of ArrayIndexOutOfBoundsException when CSV header exceeds maxColumns#56581
jubins wants to merge 1 commit into
apache:masterfrom
jubins:j-SPARK-57515-csv-header-maxcolumn

jubins commented Jun 17, 2026

Uh oh!

MaxGekk left a comment

Uh oh!

MaxGekk Jun 18, 2026

Uh oh!

MaxGekk Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jubins commented Jun 17, 2026

What is the purpose of the change?

Brief change log

Verifying this change

Does this PR potentially affect one of the following areas?

Documentation

Was generative AI tooling used to co-author this PR?

Uh oh!

MaxGekk left a comment

Choose a reason for hiding this comment

Design / architecture (1)

Suggestions (1)

Verification

Uh oh!

MaxGekk Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

MaxGekk Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants