[SPARK-57515][SQL] Surface MALFORMED_CSV_RECORD instead of ArrayIndexOutOfBoundsException when CSV header exceeds maxColumns#56581
Conversation
…OutOfBoundsException when CSV header exceeds maxColumns
MaxGekk
left a comment
There was a problem hiding this comment.
0 blocking, 2 non-blocking, 0 nits.
A clean, correct fix that faithfully mirrors the SPARK-57195 wrapper; both findings are optional polish.
Design / architecture (1)
- CSVHeaderChecker.scala:175:
malformedCsvHeaderRecordduplicatesUnivocityParser.malformedCsvRecordverbatim (same package) — see inline
Suggestions (1)
- CSVSuite.scala:3609: non-multiLine test under-asserts
badRecord(.*where the header is deterministic) — see inline
Verification
Traced the univocity exception state-space across all three header-parse sites against the established UnivocityParser wrapper: TextParsingException whose cause is ArrayIndexOutOfBoundsException, and a bare ArrayIndexOutOfBoundsException, both translate to MALFORMED_CSV_RECORD identically to the data-row path; a TextParsingException with a null or non-AIOOBE cause still propagates unchanged (null-safe — getCause.isInstanceOf is false on null); badRecord semantics match the analogue per path. The three checkHeaderColumnNames overloads are the only header-parse entry points (callers in DataFrameReader and UnivocityParser), so the fix is complete.
| } | ||
|
|
||
| // scalastyle:off line.size.limit | ||
| private def malformedCsvHeaderRecord(cause: Throwable, badRecord: String): SparkRuntimeException = { |
There was a problem hiding this comment.
malformedCsvHeaderRecord is a verbatim copy of UnivocityParser.malformedCsvRecord (UnivocityParser.scala:657-667) — same error class, same MAX_ERROR_CONTENT_LENGTH truncation, same SparkRuntimeException shape. Both classes are in the same package (org.apache.spark.sql.catalyst.csv), and the analogue is only object-private. Consider making it private[csv] and calling it from the three catch sites here, dropping this copy. That also removes a future-drift risk: a later change to MALFORMED_CSV_RECORD construction applied to one copy would silently diverge header vs. row malformed-record messages.
Optional deeper cleanup: both tokenizer callers pass a CsvParser (UnivocityParser.scala:600, 693), so narrowing the two private[csv] overloads' parameter type from AbstractParser[CsvParserSettings] to CsvParser would let the non-multiLine path reuse UnivocityParser.parseLine wholesale and drop its try/catch too.
| parameters = Map("badRecord" -> ".*"), | ||
| matchPVals = true) |
There was a problem hiding this comment.
This path passes the header verbatim as badRecord (CSVHeaderChecker.scala:164), so the exact value is deterministic and assertable here — as the Dataset[String] test below already does. .* passes regardless of the actual content, so it wouldn't catch a regression that put wrong/empty content in the error. (The multiLine test's .* is justified — its badRecord comes from getParsedContent, which isn't deterministic.)
| parameters = Map("badRecord" -> ".*"), | |
| matchPVals = true) | |
| parameters = Map("badRecord" -> "a,b,c"), | |
| matchPVals = false) |
What is the purpose of the change?
Fixes SPARK-57515. When reading a CSV file with
header=trueand the header line has more columns thanmaxColumns(default 20480, user-configurable via
.option("maxColumns", N)), Spark crashes with an internaljava.lang.ArrayIndexOutOfBoundsExceptioninstead of a cleanMALFORMED_CSV_RECORDerror.SPARK-57195 (merged 2026-06-14) fixed the same
ArrayIndexOutOfBoundsExceptionfor data rows andexplicitly called out the remaining gap: "Header rows are out of scope from this PR. A header over
maxColumnsstill surfaces the raw AIOOBE (CSVHeaderChecker), a pre-existing gap." This PRcloses that gap.
The bug affects all three CSV read paths handled by
CSVHeaderChecker:tokenizer.parseLine(header)was called directly, bypassing theAIOOBE guard that
UnivocityParser.parseLinewraps.tokenizer.parseNext()during header consumption was unguarded.Dataset[String]csv()— a freshCsvParserwas created andparser.parseLine(line)wascalled directly.
Brief change log
CSVHeaderChecker.checkHeaderColumnNames(line: String): replacedparser.parseLine(line)withUnivocityParser.parseLine(parser, line)to reuse the existing safe wrapper from SPARK-57195.CSVHeaderChecker.checkHeaderColumnNames(tokenizer): wrappedtokenizer.parseNext()in atry/catch that translates
ArrayIndexOutOfBoundsException(bare or wrapped inTextParsingException) intoMALFORMED_CSV_RECORD.CSVHeaderChecker.checkHeaderColumnNames(lines, tokenizer): wrappedtokenizer.parseLine(header)in the same try/catch.
malformedCsvHeaderRecord(mirrorsUnivocityParser.malformedCsvRecord)with the same bounded-record truncation to
MAX_ERROR_CONTENT_LENGTH.Verifying this change
Three new tests added to
CSVSuite, one per affected path:MALFORMED_CSV_RECORD — writes a 3-column CSV with
maxColumns=2, assertsMALFORMED_CSV_RECORDinstead of AIOOBE.
MALFORMED_CSV_RECORD — same with
multiLine=true.MALFORMED_CSV_RECORD — uses
spark.createDatasetpath, asserts the header line appears in theerror message.
Does this PR potentially affect one of the following areas?
CSVHeaderCheckeris internalper file
Documentation
This PR does not introduce a new feature. No documentation changes needed.
Was generative AI tooling used to co-author this PR?
verified by the author.
Generated-by: Claude Opus 4.8