OPENNLP-1846 - Recognize all entity types in NameFinderDL, not only p…#1086
Open
krickert wants to merge 1 commit into
Open
OPENNLP-1846 - Recognize all entity types in NameFinderDL, not only p…#1086krickert wants to merge 1 commit into
krickert wants to merge 1 commit into
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR fixes NameFinderDL’s BIO decoding so it recognizes all entity types emitted by a BIO-tagged token-classification ONNX model (not just PER), and ensures Span.getType() contains the entity label (e.g., PER, LOC) rather than the matched text. It also updates the DL eval tests to assert entity types and to cover the previously dropped location entity.
Changes:
- Generalize BIO span decoding in
NameFinderDLfromB-PER/I-PERtoB-<TYPE>/I-<TYPE>and setSpantype to<TYPE>. - Update
NameFinderDLEvalassertions to validate span types and covered text, including an addedLOCentity case.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| opennlp-core/opennlp-ml/opennlp-dl/src/main/java/opennlp/dl/namefinder/NameFinderDL.java | Generalizes BIO decoding and changes Span type to be the entity label. |
| opennlp-eval-tests/src/test/java/opennlp/dl/namefinder/NameFinderDLEval.java | Updates tests to assert entity labels in Span.getType() and checks an additional LOC span. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Contributor
Author
10 tasks
…decoding NameFinderDL only decoded B-PER/I-PER and put the matched text in Span.getType() instead of the entity label. Decode the BIO sequence generically and harden it: - Any B-<TYPE> begins a span whose type is the label minus the B- prefix (B-ORG -> ORG), extending while the following labels are I-<same type>. Span.getType() now reports the entity label (PER, ORG, LOC, ...) and ids2Labels fully drives recognition for any BIO-tagged model. - isBeginLabel() requires a non-empty type after "B-", so a malformed "B-" label no longer starts an empty-type span. An argmax index with no entry in ids2Labels fails loudly instead of being silently skipped. - Span.getProb() is now a numerically stable softmax over the token's label scores (bounded to [0,1]) instead of the raw max logit; handles +Inf, all-(-Inf) and NaN edge cases. - find() inference is fail-loud and consistent with the sibling DocumentCategorizerDL: failures surface as IllegalStateException (cause preserved) and an unexpected/empty model-output shape is its own loud failure, rather than a bare RuntimeException or raw ClassCastException. - Floor the character-search cursor at each sentence's start (via sentPosDetect) and thread it forward across that sentence's chunks, so a repeated entity surface form is located at its own occurrence instead of being re-matched against an earlier one -- which previously emitted duplicate or mis-located spans for multi-sentence/multi-chunk input. - Span text reconstruction matches the source with flexible whitespace (\s*), so entities whose wordpiece tokenization splits internal punctuation or "&" apart (U.S.A, AT&T) are still located instead of silently dropped. - Remove the now-unused SpanEnd record. - Extract decodeSpans()/predictLabel()/findEntityEnd()/buildSpanText() and expose labelProbability()/maxIndex() for unit testing without an ONNX model; add NameFinderDLTest coverage for entity types, bounded and edge-case probabilities, malformed begin labels, wordpiece reconstruction, internal-punctuation and case-insensitive matching, missing labels, and cursor-threaded span location. - Reconcile the OPENNLP-1844 concurrency/snapshot eval tests with the new all-types output (the George-Washington input now yields PER + LOC) and assert span types and covered text.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
…erson
NameFinderDL only decoded B-PER/I-PER and ignored every other label the model emitted, and it put the matched text in Span.getType() instead of the entity label. Decode the BIO sequence generically:
The ids2Labels map now fully drives recognition for any BIO-tagged token classification model. The B_PER/I_PER constants are retained for reference.
Update NameFinderDLEval: assert span types (PER/LOC) and covered text, and the additional location entity that the person-only decoder previously dropped ("United States" in the George Washington sentence).
Thank you for contributing to Apache OpenNLP.
In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:
For all changes:
Is there a JIRA ticket associated with this PR? Is it referenced
in the commit message?
Does your PR title start with OPENNLP-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.
Has your PR been rebased against the latest commit within the target branch (typically main)?
Is your initial contribution a single, squashed commit?
For code changes:
For documentation related changes:
Note:
Please ensure that once the PR is submitted, you check GitHub Actions for build issues and submit an update to your PR as soon as possible.