A clinical NLP system that classifies whether a medical condition in clinical text is ongoing, resolved, negated, or ambiguous — at both the phrase level and the full note level.
Key results (159-phrase annotated evaluation set):
- Rule-based classifier: 89.8% accuracy
- Hybrid triage: 100% recall of errors, 62% of predictions auto-approved at 100% accuracy
- Calibration transfer: isotonic regression reduces ECE from 0.109 → 0.003 (−97%) using only synthetic training data
- TAM: grammatical tense/aspect/modality adds +13.9% accuracy (ambiguous label +40%), 21 improved, 0 hurt
- Trajectory: intra-section status tracking adds +40% accuracy on multi-mention conditions; note-level F1 0.694 → 0.757 (+9%)
- Attribution: asserter-identity signals add +4.4% accuracy on 159-phrase set; inline attribution cases 40% → 96% (+56%)
- 332 tests across 9 test files
python3 -m venv venv && source venv/bin/activate
pip install -r requirements.txt
pytest # 332 passed
streamlit run app.py # browser UI
python main.py # CLI evaluationOptional — SciSpaCy NER (improves entity recall on real notes; requires Python 3.11):
pip install scispacy==0.5.4
pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.4/en_ner_bc5cdr_md-0.5.4.tar.gzWithout SciSpaCy, the system uses a vocabulary-based NER covering ~85 common clinical conditions with no setup required.
| Label | Meaning | Examples |
|---|---|---|
ongoing |
Condition is active, persistent, stable, controlled, or currently changing | "Persistent cough for 2 weeks", "Diabetes is stable", "Seizures controlled on medication" |
resolved |
Condition is historical, closed, or no longer active | "History of asthma", "Fever has resolved", "s/p appendectomy" |
negated |
Condition is explicitly denied or absent | "Patient denies chest pain", "No evidence of pneumonia", "Fever -ve" |
ambiguous |
Condition is uncertain, suspected, or unconfirmed | "Possible pneumonia", "Rule out sepsis", "Concern for pulmonary embolism" |
Important nuance — "better" is ongoing, not resolved:
Asthma better today → ongoing
"Better" means the condition has improved but is still present. Words like stable, controlled, improving, and better all signal an active condition being managed — not a resolved one.
flowchart TD
APP["Streamlit UI\napp.py\nPhrase · Note · Evaluate"]
CLI["CLI\nmain.py"]
subgraph PIPELINE["Full Note Pipeline — pipeline.py · process_note()"]
SD["① Section Detection\nsection_detector.py\nPMH · HPI · Assessment …\nstatus prior per section"]
NM["② Text Normalization\nnormalizer.py\nh/o · -ve · s/p …"]
NER["③ NER\nner.py"]
SS["④ Sentence Splitting\nsentence_splitter.py"]
CE["⑤ Sentence Context Extraction\nScope each entity to its own sentence"]
CL["⑥ Classify\nclassifier.py"]
DP["⑦ Dep-Parse Refinement\ndep_parser.py\nnegation scope · list negation"]
SP["⑧ Section Prior Override\nconf < 0.55 → section prior wins"]
CO["⑨ Pronoun Coreference\ncoref.py\n'It resolved.' → prior entity"]
TR["⑩ Trajectory Refinement\ntrajectory.py\ntime-decayed multi-sentence tracking"]
DD["Deduplication\nFirst mention per condition"]
end
subgraph NER_M["NER Methods — ner.py"]
SC["SciSpaCy\nen_ner_bc5cdr_md\nprimary"]
VM["Vocabulary Matcher\n~85 conditions\nzero-dep fallback"]
end
subgraph RULE_CLF["Rule-Based Classifier — classifier.py"]
PN["Pseudo-Negation Masking\n'no longer' · 'not only' …"]
CUE["Weighted Cue Matching\nNegation · Resolved · Ongoing · Ambiguous"]
TE["Temporal Signal\ntemporal.py\n'3 yrs ago' → resolved · 'today' → ongoing"]
CA["Clause-Aware Override\nFinal clause after But / However wins"]
CB["Calibration\ncalibration.py · Platt scaling"]
end
subgraph BAYES_CLF["Bayesian Fusion — bayesian_fusion.py"]
LP["Section-Conditional Log-Priors\nPMH: P(resolved)=0.55 · HPI: P(ongoing)=0.50"]
BF["Bayes-Factor Cue Updates\nLog-likelihood ratios from rules.py"]
TAM["TAM Signal\ntam.py · Tense · Aspect · Modality\nCompositional grammatical evidence"]
AT["Attribution Signal\nattribution.py · Patient · Record · Hedge"]
SM["Softmax → Posterior Distribution\n+ Shannon Entropy (0–2 bits)"]
end
HYB["Hybrid Classifier — hybrid.py\nRule-based status + Bayesian uncertainty\nTriage flag: entropy > 1.2 bits or system disagreement"]
subgraph OUTPUT["Output"]
PRES["PipelineResult\nconditions · sections_found · ner_method · warnings"]
RES["ConditionResult per entity\ncondition · status · confidence · section\ncontext · reason · trajectory"]
TF["Triage Flag\nHigh-uncertainty predictions\nflagged for human review"]
end
APP -->|full note| PIPELINE
APP -->|single phrase| HYB
CLI -->|evaluate dataset| PIPELINE
SD --> NM --> NER --> SS --> CE --> CL --> DP --> SP --> CO --> TR --> DD
NER --> NER_M
CL --> RULE_CLF
HYB --> RULE_CLF
HYB --> BAYES_CLF
HYB --> TF
DD --> PRES --> RES
Motivation. Not every "no" in clinical text is a negation. "No longer has headache" means the condition has resolved; "No improvement noted" means it persists. A naive negation detector mis-classifies both — and these patterns are common enough in clinical documentation that getting them wrong meaningfully degrades accuracy.
The system detects and masks pseudo-negation patterns before cue scoring, so the true clinical meaning is preserved.
| Phrase | Naive | This system | Why |
|---|---|---|---|
No longer has headache |
negated |
resolved |
"no longer" = condition ended, not denied |
Not improving on current regimen |
negated |
ongoing |
condition present, not responding |
No improvement noted |
negated |
ongoing |
condition persists unchanged |
No change in diabetes status |
negated |
ongoing |
unchanged = still present |
Motivation. Compound sentences contain contradictory signals. An earlier clause may describe a prior state that the final clause overrides — taking the first strong signal produces the wrong result, and taking the strongest signal ignores the temporal ordering that gives the final clause its authority.
When a sentence has clauses separated by an adversative conjunction or period, only the final clause is classified.
"I had severe flu which I think is getting better now.
But after a couple of days, it got completely over."
→ resolved (final clause wins; first clause ignored)
Motivation. Keyword cues alone cannot distinguish "chest pain 3 years ago" (resolved) from "chest pain since this morning" (ongoing). The same condition entity appears in both phrases — only the temporal expression differentiates them, and no keyword fires on the entity itself. Past and present temporal expressions carry independent evidence that must feed the classification separately from keyword matches.
| Phrase | Signal | Effect |
|---|---|---|
"DM diagnosed 3 years ago" |
past | boosts resolved |
"Chest pain since this morning" |
present | boosts ongoing |
"Was treated for pneumonia last year" |
past | boosts resolved |
"Acute onset shortness of breath" |
present | boosts ongoing |
Motivation. Fixed character windows bleed signals across sentence boundaries. "No fever. Patient has diabetes." — a 50-character window around "diabetes" reaches the "No" from sentence 1, causing a false negation. Clinical notes have high sentence density; adjacent sentences routinely carry opposite status signals.
Entity classification uses only the sentence containing the entity. Each sentence is a self-contained evidence unit.
"No fever. Patient has diabetes."
───────── ─────────────────────
sentence 1 sentence 2 → "No" from sentence 1 never reaches diabetes
Motivation. Clinical notes frequently refer back to a just-mentioned condition with "it", "this", or "they". The entity's own sentence may be a bare noun phrase with no classifiable cues — "The patient had a cough." — while the definitive status appears in the next sentence via a pronoun. Ignoring pronoun sentences loses valid evidence that is directly about the entity.
When a pronoun sentence contains a confident status signal and the entity's own sentence was weak (confidence < 0.65), the status is attributed to the most recent entity in the same section. Coref is section-scoped to prevent cross-section contamination.
"The patient had a cough. It resolved."
──────────── → cough: resolved (87% conf)
Motivation. All keyword-based classifiers — including this system's rule and Bayesian layers — are lexically grounded. They fire on words, not grammatical structure. But grammatical predicate structure carries strong status information that no keyword can capture:
"Symptoms are worsening"— no temporal adverb, but present progressive = ongoing"Blood pressure should be monitored"— no cue fires, but deontic modal = ongoing obligation"The infection had resolved"— past perfect = completed before a past reference point → resolved"Symptoms might be worsening"— strong ongoing cue, but epistemic modal raises uncertainty
Adding "might have been resolving" as a keyword phrase would handle only that exact string. TAM extracts the grammatical signature and handles any unseen combination compositely.
The TAM signature of the governing predicate is extracted and mapped to independent LLRs feeding the same Bayesian log-score accumulator as keyword cues. Tense, aspect, and modality each contribute separately:
"might have been resolving"
epistemic_weak (modal) → shifts posterior toward ambiguous
+ progressive (aspect) → shifts toward ongoing
= high-entropy posterior → triage flagged
TAM patterns are intentionally conservative — only specific clinical verb constructions, never bare is/has/was — to prevent false positives in negation contexts. "Patient had no fever" fires no TAM signal and stays negated; "The fever had resolved" fires past_perfect and boosts resolved.
Motivation. TAM captures how the predicate is expressed. Temporal signals capture when. Both are silent about who is asserting the status. A patient who thinks they have hypertension, a family member who reports a seizure history, and an EHR fragment that shows diabetes are qualitatively different claims — carrying different degrees of clinical certainty — and none are captured by keyword, temporal, or TAM signals.
Key novel case — record attribution. "Per records, hypertension" and "Records show asthma" contain no resolved keyword, no temporal adverb, and no TAM signal. Without attribution, the classifier has no resolved evidence and defaults to ongoing/ambiguous. The record source contributes LLR +1.0 to resolved — enough to produce the correct classification from a single signal.
Each asserter type maps to an independent LLR vector:
| Source | Clinical meaning | Primary LLR effect |
|---|---|---|
record |
per records/chart, records show/document | +1.0 to resolved |
patient_hedge |
patient thinks/believes/suspects | +0.80 to ambiguous |
clinician_hedge |
we think/believe, appears consistent with | +0.60 to ambiguous |
family_report |
family/wife/caregiver reports/states | +0.40 to ambiguous |
patient_report |
patient reports/states/endorses | +0.30 to ambiguous |
All attribution LLRs are capped at ±1.0. Strong keyword cues (LLR ≈ 6.9 for weight=0.999) always dominate — "Patient denies fever" stays negated even though patient_report fires, because denies (LLR ≈ 3.7) overwhelms the +0.30 attribution signal. "Family history of diabetes" does not trigger family_report because the pattern requires a report verb after the family noun.
Motivation. Every prior approach classifies each entity against the single sentence that contains it. When the same condition appears multiple times within a section, earlier sentences are discarded entirely. This throws away directional evidence: a condition mentioned as ongoing in sentence 1 and resolved in sentence 3 is a resolution — a clinically meaningful transition — not an ambiguous signal to be averaged or ignored.
The sequence of status classifications for a condition across all sentences in a section is reconciled using time-decayed log-evidence accumulation. Each transition type earns a bonus LLR encoding the clinical prior that this transition is meaningful:
| Transition | Clinical meaning | Bonus LLR |
|---|---|---|
ongoing → resolved |
resolution — condition cleared | +0.80 |
resolved → ongoing |
relapse — history item now active | +1.00 |
negated → ongoing |
contradiction — prior denial overridden | +0.90 |
ambiguous → [definite] |
clarification — uncertainty resolved | +0.60 |
Each point contributes 0.7^(n−1−i) weight — the most recent mention dominates but earlier signals still inform. This is strictly better than "take the last sentence" because early-context confidence bounds the posterior when the final sentence is weak. The transition_type field (relapse, contradiction, multi_transition) feeds directly into the hybrid triage system for clinical review.
Complementarity with coreference. Pronoun coreference handles "Patient has cough. It resolved." (pronoun bridge). Trajectory handles "Patient has cough. Cough resolved." (explicit re-mention). Neither replaces the other — both fire in the same pipeline step.
Motivation. The rule-based classifier returns a single confidence score and argmax — no distribution, no uncertainty quantification. Without a proper posterior, there is no principled basis for combining multiple evidence sources, detecting conflicting signals, or routing uncertain predictions to human review. Confidence scores based on how strongly rules fire are not the same as calibrated probabilities of being correct.
Bayesian fusion treats cue weights as calibrated likelihood ratios and accumulates all evidence into a posterior distribution. For each label ℓ:
log_score[ℓ] = log P(label=ℓ | section) ← section-conditional prior
+ Σ log(w / (1-w)) ← for each cue targeting ℓ
- Σ log(w' / (1-w')) / 3 ← for each cue targeting ℓ' ≠ ℓ
+ TAM LLR[ℓ] ← grammatical predicate structure
+ attribution LLR[ℓ] ← asserter identity
posterior[ℓ] = softmax(log_score)[ℓ]
All evidence sources — keywords, temporal, TAM, attribution — are orthogonal dimensions summing into the same log-score vector before softmax. Section priors encode clinical domain knowledge: PMH → resolved=0.55; HPI/Assessment → ongoing=0.50.
Motivation. In clinical settings, the cost of an incorrect classification is high — but routing every prediction for human review defeats the purpose of automation. The goal is to identify which specific predictions are uncertain enough to warrant review, while auto-approving the rest with provably high accuracy. Entropy over the posterior provides exactly this signal: wrong predictions have 2.5× higher entropy than correct ones.
A prediction is flagged when either:
- Bayesian entropy > 1.2 bits (tunable default)
- The rule-based system and Bayesian system predict different labels
result = classify("History of poorly controlled hypertension.")
# result["entropy"] → 1.21 bits (two signals compete)
# result["triage_flag"] → True (flagged for review)
# result["runner_up"] → ("ongoing", 0.47)Motivation. Rule-based systems assign confidence scores based on how strongly rules fire, not empirical accuracy — these scores are systematically miscalibrated. Collecting labelled real clinical text for calibration fitting is expensive and requires de-identification. If calibration can be fitted entirely on synthetic data and transferred to real text, the overhead of calibration fitting drops to near zero.
Miscalibration is driven by rule activation patterns, not surface form variation. Calibration models therefore transfer from synthetic to real text.
Isotonic regression fitted on 2,850 synthetic phrases reduces ECE by 97% on real clinical text, confirming the transfer hypothesis.
Why rule-based rather than ML?
Every prediction can be traced to a specific cue, temporal expression, or section prior. An ML baseline (TF-IDF + logistic regression, trained on the same 2,850 synthetic phrases) shows exactly where domain engineering wins:
| System | Accuracy | Misclassified |
|---|---|---|
| Rule-based (this system) | 97% | 1 / 39 |
| TF-IDF + logistic regression | 90% | 4 / 39 |
| Phrase | Gold | ML predicts | Why ML fails |
|---|---|---|---|
"Asthma better today" |
ongoing |
resolved |
"better" correlates with resolved in training data |
"s/p appendectomy" |
resolved |
ongoing |
abbreviation not expanded before ML sees it |
"Cough, no improvement noted" |
ongoing |
negated |
ML sees "no"; rule system masks pseudo-negation |
Why weighted multi-signal scoring rather than priority order?
The original system used fixed priority: negated → ambiguous → resolved → ongoing. This fails when two signals appear in the same phrase:
"History of asthma, currently worsening"
→ Priority order: resolved (first cue found)
→ Weighted scoring: ongoing (worsening 0.95 + currently 0.90 + bonus > history of 0.95)
Why a two-level pipeline?
The phrase classifier is optimised for short, focused input. The pipeline feeds it exactly what it needs — a single sentence from the appropriate section — without the classifier needing to know how sections, NER, or sentence splitting work.
Evaluated on 159 labelled clinical phrases covering all four status labels and TAM-sensitive, attribution-sensitive, and temporally-marked constructions.
| System | Accuracy | ECE |
|---|---|---|
| Rule-based | 89.8% | 0.109 |
| Bayesian fusion | 87.4% | 0.125 |
Per-label ECE — Bayesian wins on 3 of 4 categories:
| Label | Rule ECE | Bayes ECE |
|---|---|---|
| ongoing | 0.259 | 0.215 |
| resolved | 0.116 | 0.104 |
| negated | 0.142 | 0.131 |
| ambiguous | 0.140 | 0.190 |
Reproduce: python experiments/bayesian_fusion_eval.py
Comparison on 151-phrase set (includes 64 TAM-sensitive phrases), with vs without TAM:
| Metric | Without TAM | With TAM | Δ |
|---|---|---|---|
| Accuracy | 76.2% | 90.1% | +13.9% |
| ECE | 0.065 | 0.157 | +0.091 |
| Mean entropy | 1.012 bits | 0.915 bits | −9.6% |
| Predictions changed | — | 21 improved, 0 hurt | — |
Per-label accuracy on TAM-sensitive constructions (64 phrases):
| Label | Without TAM | With TAM | Δ |
|---|---|---|---|
| ambiguous | 48% | 88% | +40% |
| resolved | 84% | 93% | +9% |
| ongoing | 90% | 90% | 0% |
| negated | 86% | 86% | 0% |
TAM fires on 40/151 (26%) phrases. ECE increases because the system becomes more confident on the new phrases — while those confidences are directionally correct, the isotonic calibrator was fitted on the original 127-phrase distribution and does not re-calibrate for the expanded set.
Reproduce: python experiments/tam_eval.py
Comparison on 159-phrase set (includes 8 record-attribution phrases), with vs without attribution:
| Metric | Without attribution | With attribution | Δ |
|---|---|---|---|
| Accuracy | 86.2% | 90.6% | +4.4% |
| Predictions changed | — | 7 improved, 0 hurt | — |
Attribution fires on only 8/159 (5%) phrases — all record source — but converts every one from incorrect (ongoing) to correct (resolved). Zero predictions hurt.
Targeted evaluation on 25 attribution-specific cases (all five source types):
| Source | n | Without attribution | With attribution | Δ |
|---|---|---|---|---|
| record | 12 | 25% | 100% | +75% |
| patient_hedge | 4 | 50% | 100% | +50% |
| clinician_hedge | 4 | 0% | 75% | +75% |
| patient_report | 5 | 100% | 100% | 0% |
| Overall | 25 | 40% | 96% | +56% |
patient_report baseline is already 100% because the ongoing cue independently scores the phrase correctly — the attribution signal is additive but not load-bearing. clinician_hedge (75%) misses one case where the hedged phrase has a strong keyword cue that dominates the attribution LLR.
Reproduce: python experiments/attribution_eval.py
Evaluated on 40 multi-sentence passages, each with 2 explicit mentions of the same condition:
| Transition type | n | Baseline | Trajectory | Δ |
|---|---|---|---|---|
| resolution | 10 | 10% | 100% | +90% |
| relapse | 10 | 10% | 40% | +30% |
| contradiction | 5 | 0% | 40% | +40% |
| clarification | 5 | 0% | 40% | +40% |
| stable | 10 | 100% | 100% | 0% |
| Overall | 40 | 30% | 70% | +40% |
16 improved, 0 hurt. Trajectory never degrades a correct single-sentence prediction.
Reproduce: python experiments/trajectory_eval.py
Training set: 2,850 synthetic phrases. Test set: 159 real clinical phrases.
| Method | ECE | Brier | ECE vs raw |
|---|---|---|---|
| Uncalibrated | 0.109 | 0.091 | — |
| Platt scaling | 0.065 | 0.071 | −40% |
| Isotonic regression | 0.003 | 0.061 | −97% |
| Temperature scaling | 0.130 | 0.095 | +20% |
Per-category ECE:
| Category | n | Uncalibrated | Platt | Isotonic | Temperature |
|---|---|---|---|---|---|
| ongoing | 42 | 0.259 | 0.093 | 0.153 | 0.236 |
| resolved | 38 | 0.116 | 0.083 | 0.032 | 0.102 |
| negated | 22 | 0.142 | 0.115 | 0.095 | 0.157 |
| ambiguous | 25 | 0.140 | 0.176 | 0.116 | 0.164 |
Temperature scaling fails because miscalibration is non-uniform — a single global scalar cannot correct category-specific over/under-confidence. For ongoing, Platt (ECE 0.093) outperforms isotonic (0.153) due to wider score spread; isotonic dominates for resolved and negated.
Reproduce: python experiments/calibration_transfer.py
Evaluated on 127-phrase set, default threshold 1.2 bits:
| Metric | Value |
|---|---|
| Phrases flagged | 48 / 127 (38%) |
| Recall of errors | 100% — every wrong prediction flagged |
| Precision | 27% — 1 in 3.7 flagged phrases is an actual error |
| Auto-approved accuracy | 100% — 62% of predictions, no errors |
| Review efficiency | 2.6× fewer phrases read per error |
Wrong predictions have 2.5× higher entropy than correct ones, making entropy a reliable signal for routing uncertain predictions to human review. At best-F1 threshold (1.8 bits): 21% flagged, 77% recall, 37% precision. The threshold is tunable — evaluate_triage(csv_path, thresholds=[...]) sweeps thresholds and returns precision/recall/F1 at each.
Reproduce: python experiments/hybrid_eval.py
Evaluated on 7 annotated clinical notes (4 general + 3 trajectory-specific: resolution, relapse, clarification), with vs without trajectory:
| Metric | Without trajectory | With trajectory | Δ |
|---|---|---|---|
| Precision | 0.625 | 0.684 | +9.4% |
| Recall | 0.781 | 0.848 | +8.6% |
| F1 | 0.694 | 0.757 | +9.1% |
14/14 trajectory-dependent conditions correctly classified across the 3 new notes. The 7 FN errors in both configurations are pre-existing NER misses unrelated to status classification.
Reproduce: python main.py (note-level section)
Raw phrase
│
▼
Abbreviation normalizer h/o → history of, -ve → negative for, s/p → status post
│
▼
Pseudo-negation masking "no longer", "not improving" → masked before cue matching
│
▼
Multi-signal cue matching word-boundary regex, weighted scores per category
│
▼
Temporal signal detection past/present time expressions boost resolved/ongoing
│
▼
Adversative clause check "But…" / "However…" / "." → classify final clause
│
▼
Conflict detection competing signals → reduced confidence
│
▼
Platt calibration raw cue-score → P(correct)
│
▼
{status, confidence, calibrated_confidence, cue, reason, signals}
Raw phrase + section
│
├──► Rule-based classifier ──► status, confidence, calibrated_confidence
│
└──► Bayesian fusion ──────► posterior {label: P}, entropy (bits)
│
▼
Triage decision entropy > 1.2 bits OR systems disagree → triage_flag=True
│
▼
{status, posterior, entropy, runner_up, triage_flag, triage_reason, agreement}
Raw clinical note
│
▼
Section detector PMH → resolved prior | HPI → ongoing prior
│
▼ (per section)
Abbreviation normalizer → Sentence splitter → NER
│
▼
Context extraction sentence containing each entity (not fixed char window)
│
▼
Phrase classifier → Dep parser refinement → Section prior override
│
▼
Pronoun coreference → Trajectory refinement → Deduplication
│
▼
[{condition, status, confidence, section, reason, trajectory?}, ...]
from src.classifier import classify_condition_status
from src.hybrid import classify
# Rule-based only
classify_condition_status("History of asthma, currently worsening")
# → {"status": "ongoing", "confidence": 1.0, "calibrated_confidence": 0.97, ...}
# Hybrid (rule-based + Bayesian uncertainty)
classify("No evidence of pneumonia.")
# → {"status": "negated", "triage_flag": False, "entropy": 0.0, ...}from src.pipeline import process_note, format_results
note = """
History of Present Illness:
67-year-old female presenting with worsening dyspnea.
She reports fatigue for 3 days. Denies chest pain or fever.
Past Medical History:
Hypertension, type 2 diabetes mellitus (diagnosed 5 years ago),
h/o pneumonia (resolved last year), atrial fibrillation controlled on medication.
Assessment:
Possible heart failure exacerbation. Rule out pulmonary embolism.
"""
print(format_results(process_note(note)))CONDITION STATUS CONF SECTION
----------------------------------------------------------------
dyspnea ongoing 100% history_of_present_illness
chest pain negated 100% history_of_present_illness
fever negated 100% history_of_present_illness
Hypertension resolved 100% past_medical_history
type 2 diabetes mellitus resolved 100% past_medical_history
pneumonia resolved 100% past_medical_history
atrial fibrillation resolved 100% past_medical_history
heart failure ambiguous 76% assessment
pulmonary embolism ambiguous 95% assessment
classify_condition_status(text):
{
"status": "ongoing",
"confidence": 0.82,
"calibrated_confidence": 0.95,
"cue": "worsening",
"reason": "Ongoing/active cue found: 'worsening' | Temporal hint: present",
"signals": {
"negated": 0.0, "ambiguous": 0.0, "resolved": 0.95, "ongoing": 1.0,
"temporal": "present", "pseudo_negations": [], "clause_used": "full"
}
}hybrid.classify(text, section):
{
"status": "ongoing",
"confidence": 0.82,
"calibrated_confidence": 0.95,
"posterior": {"ongoing": 0.71, "resolved": 0.18, "negated": 0.07, "ambiguous": 0.04},
"entropy": 0.43,
"runner_up": ("resolved", 0.18),
"agreement": True,
"triage_flag": False,
"triage_reason": "",
"rule_reason": "Ongoing/active cue found: 'worsening'",
"rule_cue": "worsening",
"bayes_status": "ongoing",
"signals": { ... }
}streamlit run app.py| Tab | What it does |
|---|---|
| Single Phrase | Classify a phrase; shows triage flag, posterior bar chart, runner-up, confidence, expanded abbreviations |
| Full Clinical Note | Paste a note; runs the full pipeline and returns a colour-coded condition table |
| Evaluate Dataset | Phrase accuracy, reliability diagram + ECE, ML baseline comparison, calibration methods comparison, note-level P/R/F1 |
condition-status-classifier/
│
├── data/
│ ├── clinical_phrases.csv 159-phrase labelled dataset (incl. TAM-sensitive, record attribution)
│ ├── annotated_notes.json 7 annotated clinical notes (4 original + 3 trajectory-specific)
│ ├── calibration.json fitted Platt scaler parameters
│ ├── calibration_phrases.csv 2,850 synthetic phrases for calibration fitting
│ └── generate_calibration_dataset.py
│
├── src/
│ ├── normalizer.py abbreviation expansion
│ ├── rules.py weighted cues (100+ entries) + pseudo-negation patterns
│ ├── temporal.py past/present temporal signal detection
│ ├── classifier.py phrase-level classifier
│ ├── dep_parser.py spaCy dep-tree: negation scope, list negation, temporal scope
│ ├── section_detector.py note section splitter
│ ├── ner.py NER (SciSpaCy primary / vocabulary fallback)
│ ├── sentence_splitter.py clinical sentence boundary detection
│ ├── coref.py pronoun-to-entity coreference within sections
│ ├── trajectory.py intra-section status trajectory (time-decay reconciliation)
│ ├── pipeline.py full note pipeline (incl. trajectory refinement)
│ ├── calibration.py Platt scaler + ECE + calibration transfer helpers
│ ├── tam.py TAM extraction (tense/aspect/modality → LLRs)
│ ├── attribution.py attribution-aware confidence (asserter identity → LLRs)
│ ├── bayesian_fusion.py Bayesian evidence fusion (posterior + entropy + TAM + attribution)
│ ├── hybrid.py hybrid classifier (rule-based MAP + Bayesian triage)
│ ├── baseline.py TF-IDF + logistic regression baseline
│ ├── note_evaluator.py pipeline P/R/F1 on annotated notes
│ └── utils.py phrase-level dataset evaluation
│
├── experiments/
│ ├── calibration_transfer.py
│ ├── bayesian_fusion_eval.py
│ ├── hybrid_eval.py
│ ├── tam_eval.py
│ ├── trajectory_eval.py
│ └── attribution_eval.py
│
├── tests/
│ ├── test_classifier.py 33 tests
│ ├── test_pipeline.py 44 tests
│ ├── test_coref.py 21 tests
│ ├── test_dep_and_calibration.py 38 tests
│ ├── test_bayesian_fusion.py 36 tests
│ ├── test_hybrid.py 31 tests
│ ├── test_tam.py 52 tests
│ ├── test_trajectory.py 29 tests
│ └── test_attribution.py 49 tests
│
├── app.py
├── main.py
├── pytest.ini
└── requirements.txt
| Module | Role |
|---|---|
src/normalizer.py |
Expands clinical abbreviations (h/o → history of, s/p → status post, -ve → negative for) |
src/rules.py |
Weighted cue lists (100+ entries) ordered most-to-least specific + pseudo-negation registry |
src/temporal.py |
Past/present temporal regex patterns; returns signal and confidence |
src/classifier.py |
Orchestrates phrase-level classification; emits calibrated_confidence |
src/dep_parser.py |
spaCy dependency parsing: negation scope, list negation, temporal modifier scope |
src/section_detector.py |
Splits clinical notes into labeled sections; drives section-conditional priors |
src/ner.py |
SciSpaCy primary NER + vocabulary fallback (~85 conditions) |
src/sentence_splitter.py |
Clinical sentence boundary detection with abbreviation protection |
src/coref.py |
Pronoun coreference within sections; fires only when entity confidence < 0.65 |
src/trajectory.py |
Intra-section status trajectory: time-decayed log-evidence accumulation + transition-type LLR bonuses |
src/pipeline.py |
Orchestrates the full note-level pipeline; wires dep parser, coref, and trajectory refinement |
src/calibration.py |
Platt scaler (calibrate()), reliability diagram, ECE, Brier, isotonic/temperature methods |
src/tam.py |
TAM extraction: tense/aspect/modality → LLRs; compositionality over novel predicate constructions |
src/attribution.py |
Attribution-aware confidence: extracts asserter identity (record/patient/family/clinician) and maps each source to independent LLRs |
src/bayesian_fusion.py |
Bayesian evidence fusion: posterior distribution, Shannon entropy, section priors, TAM + attribution integration |
src/hybrid.py |
Hybrid classifier: rule-based MAP label + Bayesian posterior + entropy triage flag |
src/baseline.py |
TF-IDF + logistic regression baseline |
src/note_evaluator.py |
Precision/recall/F1 evaluation on annotated clinical notes |
src/utils.py |
Phrase-level dataset evaluation helper |
| Limitation | Potential improvement |
|---|---|
| NER misses rare conditions, specialist terminology, and misspellings | Fine-tuned clinical NER or BERT-based token classifier trained on de-identified notes |
| Rule-based classification plateaus on novel phrasing and conditions outside the cue vocabulary | Fine-tuned BERT / clinical LLM trained on labelled clinical notes |
| Platt scaler fitted on 2,850 synthetic phrases — less reliable at the tails | Collect ≥500 real labelled phrases and refit |
| Dep-tree heuristics can mis-scope modifiers in complex multi-clause sentences | Dedicated relation extraction model linking negation/temporality to entity spans |
| Pronoun coreference fails when multiple entities are plausible antecedents | Neural coreference resolution (e.g. SpanBERT-based) |
| Section prior threshold (0.55) is not empirically calibrated | Tune on a held-out annotated note set |
Dep parser silently degrades to regex-only if en_core_web_sm is not installed |
Surface a clear warning in the UI and CLI |