Skip to content

OPENNLP-1846 - Recognize all entity types in NameFinderDL, not only p…#1086

Open
krickert wants to merge 1 commit into
apache:mainfrom
ai-pipestream:OPENNLP-1846
Open

OPENNLP-1846 - Recognize all entity types in NameFinderDL, not only p…#1086
krickert wants to merge 1 commit into
apache:mainfrom
ai-pipestream:OPENNLP-1846

Conversation

@krickert

@krickert krickert commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

…erson

NameFinderDL only decoded B-PER/I-PER and ignored every other label the model emitted, and it put the matched text in Span.getType() instead of the entity label. Decode the BIO sequence generically:

  • any B- begins a span whose type is the label minus the B- prefix (B-ORG -> ORG), and the span extends while the following labels are I- (findSpanEnd generalized from I-PER to I-);
  • Span.getType() now reports the entity label (PER, ORG, LOC, ...) rather than the covered text.

The ids2Labels map now fully drives recognition for any BIO-tagged token classification model. The B_PER/I_PER constants are retained for reference.

Update NameFinderDLEval: assert span types (PER/LOC) and covered text, and the additional location entity that the person-only decoder previously dropped ("United States" in the George Washington sentence).

Thank you for contributing to Apache OpenNLP.

In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:

For all changes:

  • Is there a JIRA ticket associated with this PR? Is it referenced
    in the commit message?

  • Does your PR title start with OPENNLP-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.

  • Has your PR been rebased against the latest commit within the target branch (typically main)?

  • Is your initial contribution a single, squashed commit?

For code changes:

  • Have you ensured that the full suite of tests is executed via mvn clean install at the root opennlp folder?
  • Have you written or updated unit tests to verify your changes?
  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • If applicable, have you updated the LICENSE file, including the main LICENSE file in opennlp folder?
  • If applicable, have you updated the NOTICE file, including the main NOTICE file found in opennlp folder?

For documentation related changes:

  • Have you ensured that format looks appropriate for the output in which it is rendered?

Note:

Please ensure that once the PR is submitted, you check GitHub Actions for build issues and submit an update to your PR as soon as possible.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes NameFinderDL’s BIO decoding so it recognizes all entity types emitted by a BIO-tagged token-classification ONNX model (not just PER), and ensures Span.getType() contains the entity label (e.g., PER, LOC) rather than the matched text. It also updates the DL eval tests to assert entity types and to cover the previously dropped location entity.

Changes:

  • Generalize BIO span decoding in NameFinderDL from B-PER/I-PER to B-<TYPE>/I-<TYPE> and set Span type to <TYPE>.
  • Update NameFinderDLEval assertions to validate span types and covered text, including an added LOC entity case.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
opennlp-core/opennlp-ml/opennlp-dl/src/main/java/opennlp/dl/namefinder/NameFinderDL.java Generalizes BIO decoding and changes Span type to be the entity label.
opennlp-eval-tests/src/test/java/opennlp/dl/namefinder/NameFinderDLEval.java Updates tests to assert entity labels in Span.getType() and checks an additional LOC span.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread opennlp-eval-tests/src/test/java/opennlp/dl/namefinder/NameFinderDLEval.java Outdated
@krickert

Copy link
Copy Markdown
Contributor Author

I'll wait for #1085 and #1084 before I move this out of draft.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

…decoding

NameFinderDL only decoded B-PER/I-PER and put the matched text in
Span.getType() instead of the entity label. Decode the BIO sequence
generically and harden it:

- Any B-<TYPE> begins a span whose type is the label minus the B- prefix
  (B-ORG -> ORG), extending while the following labels are I-<same type>.
  Span.getType() now reports the entity label (PER, ORG, LOC, ...) and
  ids2Labels fully drives recognition for any BIO-tagged model.
- isBeginLabel() requires a non-empty type after "B-", so a malformed "B-"
  label no longer starts an empty-type span. An argmax index with no entry
  in ids2Labels fails loudly instead of being silently skipped.
- Span.getProb() is now a numerically stable softmax over the token's label
  scores (bounded to [0,1]) instead of the raw max logit; handles +Inf,
  all-(-Inf) and NaN edge cases.
- find() inference is fail-loud and consistent with the sibling
  DocumentCategorizerDL: failures surface as IllegalStateException (cause
  preserved) and an unexpected/empty model-output shape is its own loud
  failure, rather than a bare RuntimeException or raw ClassCastException.
- Floor the character-search cursor at each sentence's start (via
  sentPosDetect) and thread it forward across that sentence's chunks, so a
  repeated entity surface form is located at its own occurrence instead of
  being re-matched against an earlier one -- which previously emitted
  duplicate or mis-located spans for multi-sentence/multi-chunk input.
- Span text reconstruction matches the source with flexible whitespace
  (\s*), so entities whose wordpiece tokenization splits internal
  punctuation or "&" apart (U.S.A, AT&T) are still located instead of
  silently dropped.
- Remove the now-unused SpanEnd record.
- Extract decodeSpans()/predictLabel()/findEntityEnd()/buildSpanText() and
  expose labelProbability()/maxIndex() for unit testing without an ONNX
  model; add NameFinderDLTest coverage for entity types, bounded and
  edge-case probabilities, malformed begin labels, wordpiece
  reconstruction, internal-punctuation and case-insensitive matching,
  missing labels, and cursor-threaded span location.
- Reconcile the OPENNLP-1844 concurrency/snapshot eval tests with the new
  all-types output (the George-Washington input now yields PER + LOC) and
  assert span types and covered text.
@krickert krickert marked this pull request as ready for review June 16, 2026 21:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants