Skip to content

[SPARK-57515][SQL] Surface MALFORMED_CSV_RECORD instead of ArrayIndexOutOfBoundsException when CSV header exceeds maxColumns#56581

Open
jubins wants to merge 1 commit into
apache:masterfrom
jubins:j-SPARK-57515-csv-header-maxcolumn
Open

[SPARK-57515][SQL] Surface MALFORMED_CSV_RECORD instead of ArrayIndexOutOfBoundsException when CSV header exceeds maxColumns#56581
jubins wants to merge 1 commit into
apache:masterfrom
jubins:j-SPARK-57515-csv-header-maxcolumn

Conversation

@jubins

@jubins jubins commented Jun 17, 2026

Copy link
Copy Markdown

What is the purpose of the change?

Fixes SPARK-57515. When reading a CSV file with header=true and the header line has more columns than maxColumns
(default 20480, user-configurable via .option("maxColumns", N)), Spark crashes with an internal
java.lang.ArrayIndexOutOfBoundsException instead of a clean MALFORMED_CSV_RECORD error.

SPARK-57195 (merged 2026-06-14) fixed the same ArrayIndexOutOfBoundsException for data rows and
explicitly called out the remaining gap: "Header rows are out of scope from this PR. A header over
maxColumns still surfaces the raw AIOOBE (CSVHeaderChecker), a pre-existing gap."
This PR
closes that gap.

The bug affects all three CSV read paths handled by CSVHeaderChecker:

  • Non-multiLine file readtokenizer.parseLine(header) was called directly, bypassing the
    AIOOBE guard that UnivocityParser.parseLine wraps.
  • MultiLine file readtokenizer.parseNext() during header consumption was unguarded.
  • Dataset[String] csv() — a fresh CsvParser was created and parser.parseLine(line) was
    called directly.

Brief change log

  • CSVHeaderChecker.checkHeaderColumnNames(line: String): replaced parser.parseLine(line) with
    UnivocityParser.parseLine(parser, line) to reuse the existing safe wrapper from SPARK-57195.
  • CSVHeaderChecker.checkHeaderColumnNames(tokenizer): wrapped tokenizer.parseNext() in a
    try/catch that translates ArrayIndexOutOfBoundsException (bare or wrapped in
    TextParsingException) into MALFORMED_CSV_RECORD.
  • CSVHeaderChecker.checkHeaderColumnNames(lines, tokenizer): wrapped tokenizer.parseLine(header)
    in the same try/catch.
  • Added private helper malformedCsvHeaderRecord (mirrors UnivocityParser.malformedCsvRecord)
    with the same bounded-record truncation to MAX_ERROR_CONTENT_LENGTH.

Verifying this change

Three new tests added to CSVSuite, one per affected path:

  • SPARK-57515: non-multiLine CSV read with header exceeding maxColumns surfaces
    MALFORMED_CSV_RECORD
    — writes a 3-column CSV with maxColumns=2, asserts MALFORMED_CSV_RECORD
    instead of AIOOBE.
  • SPARK-57515: multiLine CSV read with header exceeding maxColumns surfaces
    MALFORMED_CSV_RECORD
    — same with multiLine=true.
  • SPARK-57515: Dataset[String] CSV read with header exceeding maxColumns surfaces
    MALFORMED_CSV_RECORD
    — uses spark.createDataset path, asserts the header line appears in the
    error message.

Does this PR potentially affect one of the following areas?

  • Dependencies: no
  • Public API: no — CSVHeaderChecker is internal
  • Serializers: no
  • Runtime per-record code paths (performance): no — only the header-parsing path, which runs once
    per file
  • Deployment or recovery: no
  • S3 connector: no

Documentation

This PR does not introduce a new feature. No documentation changes needed.

Was generative AI tooling used to co-author this PR?

  • Yes — Claude Code was used as a pair-programming assistant. All code was written, understood, and
    verified by the author.

Generated-by: Claude Opus 4.8

…OutOfBoundsException when CSV header exceeds maxColumns

@MaxGekk MaxGekk left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 blocking, 2 non-blocking, 0 nits.
A clean, correct fix that faithfully mirrors the SPARK-57195 wrapper; both findings are optional polish.

Design / architecture (1)

  • CSVHeaderChecker.scala:175: malformedCsvHeaderRecord duplicates UnivocityParser.malformedCsvRecord verbatim (same package) — see inline

Suggestions (1)

  • CSVSuite.scala:3609: non-multiLine test under-asserts badRecord (.* where the header is deterministic) — see inline

Verification

Traced the univocity exception state-space across all three header-parse sites against the established UnivocityParser wrapper: TextParsingException whose cause is ArrayIndexOutOfBoundsException, and a bare ArrayIndexOutOfBoundsException, both translate to MALFORMED_CSV_RECORD identically to the data-row path; a TextParsingException with a null or non-AIOOBE cause still propagates unchanged (null-safe — getCause.isInstanceOf is false on null); badRecord semantics match the analogue per path. The three checkHeaderColumnNames overloads are the only header-parse entry points (callers in DataFrameReader and UnivocityParser), so the fix is complete.

}

// scalastyle:off line.size.limit
private def malformedCsvHeaderRecord(cause: Throwable, badRecord: String): SparkRuntimeException = {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

malformedCsvHeaderRecord is a verbatim copy of UnivocityParser.malformedCsvRecord (UnivocityParser.scala:657-667) — same error class, same MAX_ERROR_CONTENT_LENGTH truncation, same SparkRuntimeException shape. Both classes are in the same package (org.apache.spark.sql.catalyst.csv), and the analogue is only object-private. Consider making it private[csv] and calling it from the three catch sites here, dropping this copy. That also removes a future-drift risk: a later change to MALFORMED_CSV_RECORD construction applied to one copy would silently diverge header vs. row malformed-record messages.

Optional deeper cleanup: both tokenizer callers pass a CsvParser (UnivocityParser.scala:600, 693), so narrowing the two private[csv] overloads' parameter type from AbstractParser[CsvParserSettings] to CsvParser would let the non-multiLine path reuse UnivocityParser.parseLine wholesale and drop its try/catch too.

Comment on lines +3609 to +3610
parameters = Map("badRecord" -> ".*"),
matchPVals = true)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This path passes the header verbatim as badRecord (CSVHeaderChecker.scala:164), so the exact value is deterministic and assertable here — as the Dataset[String] test below already does. .* passes regardless of the actual content, so it wouldn't catch a regression that put wrong/empty content in the error. (The multiLine test's .* is justified — its badRecord comes from getParsedContent, which isn't deterministic.)

Suggested change
parameters = Map("badRecord" -> ".*"),
matchPVals = true)
parameters = Map("badRecord" -> "a,b,c"),
matchPVals = false)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants