Skip to content

Clear delimiter buffer before each peek in isDelimiter#611

Merged
garydgregory merged 2 commits into
apache:masterfrom
rootvector2:delimiter-buffer-clear
Jun 17, 2026
Merged

Clear delimiter buffer before each peek in isDelimiter#611
garydgregory merged 2 commits into
apache:masterfrom
rootvector2:delimiter-buffer-clear

Conversation

@rootvector2

Copy link
Copy Markdown
Contributor

isDelimiter peeks the next characters into the reused delimiterBuf look-ahead buffer without clearing it, so a non-matching peek earlier in the same token leaves stale chars in the trailing positions and a truncated multi-character delimiter at EOF false-matches. With delimiter [|], input x[a][| is split into x[a] and an empty field, because the earlier [a] peek left ] in the buffer and the trailing [| (only two of the three delimiter chars present before EOF) matches against that stale ]. CSV-324 cleared delimiterBuf once per nextToken, but isDelimiter peeks repeatedly within a token, so the reset has to happen before the peek, the same way isEscapeDelimiter already does it; the once-per-token clear is then redundant and dropped.

Found while auditing the multi-character delimiter look-ahead after CSV-324.

  • Read the contribution guidelines for this project.
  • Read the ASF Generative Tooling Guidance if you use Artificial Intelligence (AI).
  • I used AI to create any part of, or all of, this pull request. Which AI tool was used to create this pull request, and to what extent did it contribute?
  • Run a successful build using the default Maven goal with mvn; that's mvn on the command line by itself.
  • Write unit tests that match behavioral changes, where the tests fail if the changes to the runtime are not applied. This may not always be possible, but it is a best practice.
  • Write a pull request description that is detailed enough to understand what the pull request does, how, and why.
  • Each commit in the pull request should have a meaningful subject line and body.

@garydgregory garydgregory changed the title clear delimiter buffer before each peek in isDelimiter Clear delimiter buffer before each peek in isDelimiter Jun 17, 2026

@garydgregory garydgregory left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @rootvector2

Thank you for your PR.

Would you please add a test that shows why this is an issue only using the public parsing API?

@rootvector2

Copy link
Copy Markdown
Contributor Author

Added testPartialMultiCharacterDelimiterAtEOFAfterMismatch in CSVParserTest that drives this through the public parsing API. Parsing x[a][| with delimiter [|]: without the fix it splits into x[a] and an empty field, with the fix it stays a single field x[a][|. The test fails on master and passes with the patch.

void testPartialMultiCharacterDelimiterAtEOFAfterMismatch() throws IOException {
final CSVFormat format = CSVFormat.DEFAULT.builder().setDelimiter("[|]").get();
// The "[a]" peek leaves ']' in the look-ahead buffer; the trailing "[|" must not match "[|]".
try (CSVParser parser = format.parse(new StringReader("x[a][|"))) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rootvector2

Thank you for your update.

In the future, please make magic strings like "x[a][|" a local variable such that you never have to check that the assertion below is checking for a possibly different value.

void testPartialMultiCharacterDelimiterAtEOFAfterMismatch() throws IOException {
final CSVFormat format = CSVFormat.DEFAULT.builder().setDelimiter("[|]").get();
// The "[a]" peek leaves ']' in the look-ahead buffer; the trailing "[|" must not match "[|]".
try (Lexer lexer = createLexer("x[a][|", format)) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rootvector2

Thank you for your update.

In the future, please make magic strings like "x[a][|" a local variable such that you never have to check that the assertion below is checking for a possibly different value.

@garydgregory garydgregory merged commit f685de6 into apache:master Jun 17, 2026
16 checks passed
garydgregory added a commit that referenced this pull request Jun 17, 2026
@garydgregory

Copy link
Copy Markdown
Member

@rootvector2
Please see my comments for future PRs. Merged and thank you! 🚀

@rootvector2

Copy link
Copy Markdown
Contributor Author

makes sense, will pull magic strings like that into locals going forward. thanks for the merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants