Clinical Condition Status Classifier

A clinical NLP system that classifies whether a medical condition in clinical text is ongoing, resolved, negated, or ambiguous — at both the phrase level and the full note level.

Key results (159-phrase annotated evaluation set):

Rule-based classifier: 89.8% accuracy
Hybrid triage: 100% recall of errors, 62% of predictions auto-approved at 100% accuracy
Calibration transfer: isotonic regression reduces ECE from 0.109 → 0.003 (−97%) using only synthetic training data
TAM: grammatical tense/aspect/modality adds +13.9% accuracy (ambiguous label +40%), 21 improved, 0 hurt
Trajectory: intra-section status tracking adds +40% accuracy on multi-mention conditions; note-level F1 0.694 → 0.757 (+9%)
Attribution: asserter-identity signals add +4.4% accuracy on 159-phrase set; inline attribution cases 40% → 96% (+56%)
332 tests across 9 test files

Quick Start

python3 -m venv venv && source venv/bin/activate
pip install -r requirements.txt
pytest                      # 332 passed
streamlit run app.py        # browser UI
python main.py              # CLI evaluation

Optional — SciSpaCy NER (improves entity recall on real notes; requires Python 3.11):

pip install scispacy==0.5.4
pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.4/en_ner_bc5cdr_md-0.5.4.tar.gz

Without SciSpaCy, the system uses a vocabulary-based NER covering ~85 common clinical conditions with no setup required.

Status Labels

Label	Meaning	Examples
`ongoing`	Condition is active, persistent, stable, controlled, or currently changing	`"Persistent cough for 2 weeks"`, `"Diabetes is stable"`, `"Seizures controlled on medication"`
`resolved`	Condition is historical, closed, or no longer active	`"History of asthma"`, `"Fever has resolved"`, `"s/p appendectomy"`
`negated`	Condition is explicitly denied or absent	`"Patient denies chest pain"`, `"No evidence of pneumonia"`, `"Fever -ve"`
`ambiguous`	Condition is uncertain, suspected, or unconfirmed	`"Possible pneumonia"`, `"Rule out sepsis"`, `"Concern for pulmonary embolism"`

Important nuance — "better" is ongoing, not resolved:

Asthma better today  →  ongoing

"Better" means the condition has improved but is still present. Words like stable, controlled, improving, and better all signal an active condition being managed — not a resolved one.

Architecture

flowchart TD
    APP["Streamlit UI\napp.py\nPhrase · Note · Evaluate"]
    CLI["CLI\nmain.py"]

    subgraph PIPELINE["Full Note Pipeline — pipeline.py · process_note()"]
        SD["① Section Detection\nsection_detector.py\nPMH · HPI · Assessment …\nstatus prior per section"]
        NM["② Text Normalization\nnormalizer.py\nh/o · -ve · s/p …"]
        NER["③ NER\nner.py"]
        SS["④ Sentence Splitting\nsentence_splitter.py"]
        CE["⑤ Sentence Context Extraction\nScope each entity to its own sentence"]
        CL["⑥ Classify\nclassifier.py"]
        DP["⑦ Dep-Parse Refinement\ndep_parser.py\nnegation scope · list negation"]
        SP["⑧ Section Prior Override\nconf < 0.55 → section prior wins"]
        CO["⑨ Pronoun Coreference\ncoref.py\n'It resolved.' → prior entity"]
        TR["⑩ Trajectory Refinement\ntrajectory.py\ntime-decayed multi-sentence tracking"]
        DD["Deduplication\nFirst mention per condition"]
    end

    subgraph NER_M["NER Methods — ner.py"]
        SC["SciSpaCy\nen_ner_bc5cdr_md\nprimary"]
        VM["Vocabulary Matcher\n~85 conditions\nzero-dep fallback"]
    end

    subgraph RULE_CLF["Rule-Based Classifier — classifier.py"]
        PN["Pseudo-Negation Masking\n'no longer' · 'not only' …"]
        CUE["Weighted Cue Matching\nNegation · Resolved · Ongoing · Ambiguous"]
        TE["Temporal Signal\ntemporal.py\n'3 yrs ago' → resolved · 'today' → ongoing"]
        CA["Clause-Aware Override\nFinal clause after But / However wins"]
        CB["Calibration\ncalibration.py · Platt scaling"]
    end

    subgraph BAYES_CLF["Bayesian Fusion — bayesian_fusion.py"]
        LP["Section-Conditional Log-Priors\nPMH: P(resolved)=0.55 · HPI: P(ongoing)=0.50"]
        BF["Bayes-Factor Cue Updates\nLog-likelihood ratios from rules.py"]
        TAM["TAM Signal\ntam.py · Tense · Aspect · Modality\nCompositional grammatical evidence"]
        AT["Attribution Signal\nattribution.py · Patient · Record · Hedge"]
        SM["Softmax → Posterior Distribution\n+ Shannon Entropy (0–2 bits)"]
    end

    HYB["Hybrid Classifier — hybrid.py\nRule-based status + Bayesian uncertainty\nTriage flag: entropy > 1.2 bits or system disagreement"]

    subgraph OUTPUT["Output"]
        PRES["PipelineResult\nconditions · sections_found · ner_method · warnings"]
        RES["ConditionResult per entity\ncondition · status · confidence · section\ncontext · reason · trajectory"]
        TF["Triage Flag\nHigh-uncertainty predictions\nflagged for human review"]
    end

    APP -->|full note| PIPELINE
    APP -->|single phrase| HYB
    CLI -->|evaluate dataset| PIPELINE

    SD --> NM --> NER --> SS --> CE --> CL --> DP --> SP --> CO --> TR --> DD

    NER --> NER_M

    CL --> RULE_CLF
    HYB --> RULE_CLF
    HYB --> BAYES_CLF
    HYB --> TF

    DD --> PRES --> RES

How It Works

1. Pseudo-negation filtering

Motivation. Not every "no" in clinical text is a negation. "No longer has headache" means the condition has resolved; "No improvement noted" means it persists. A naive negation detector mis-classifies both — and these patterns are common enough in clinical documentation that getting them wrong meaningfully degrades accuracy.

The system detects and masks pseudo-negation patterns before cue scoring, so the true clinical meaning is preserved.

Phrase	Naive	This system	Why
`No longer has headache`	`negated`	`resolved`	"no longer" = condition ended, not denied
`Not improving on current regimen`	`negated`	`ongoing`	condition present, not responding
`No improvement noted`	`negated`	`ongoing`	condition persists unchanged
`No change in diabetes status`	`negated`	`ongoing`	unchanged = still present

2. Adversative clause detection

Motivation. Compound sentences contain contradictory signals. An earlier clause may describe a prior state that the final clause overrides — taking the first strong signal produces the wrong result, and taking the strongest signal ignores the temporal ordering that gives the final clause its authority.

When a sentence has clauses separated by an adversative conjunction or period, only the final clause is classified.

"I had severe flu which I think is getting better now.
 But after a couple of days, it got completely over."
→ resolved  (final clause wins; first clause ignored)

3. Temporal signal detection

Motivation. Keyword cues alone cannot distinguish "chest pain 3 years ago" (resolved) from "chest pain since this morning" (ongoing). The same condition entity appears in both phrases — only the temporal expression differentiates them, and no keyword fires on the entity itself. Past and present temporal expressions carry independent evidence that must feed the classification separately from keyword matches.

Phrase	Signal	Effect
`"DM diagnosed 3 years ago"`	past	boosts resolved
`"Chest pain since this morning"`	present	boosts ongoing
`"Was treated for pneumonia last year"`	past	boosts resolved
`"Acute onset shortness of breath"`	present	boosts ongoing

4. Sentence-boundary-aware context windows

Motivation. Fixed character windows bleed signals across sentence boundaries. "No fever. Patient has diabetes." — a 50-character window around "diabetes" reaches the "No" from sentence 1, causing a false negation. Clinical notes have high sentence density; adjacent sentences routinely carry opposite status signals.

Entity classification uses only the sentence containing the entity. Each sentence is a self-contained evidence unit.

"No fever.  Patient has diabetes."
 ─────────  ─────────────────────
 sentence 1   sentence 2 → "No" from sentence 1 never reaches diabetes

5. Pronoun coreference within sections

Motivation. Clinical notes frequently refer back to a just-mentioned condition with "it", "this", or "they". The entity's own sentence may be a bare noun phrase with no classifiable cues — "The patient had a cough." — while the definitive status appears in the next sentence via a pronoun. Ignoring pronoun sentences loses valid evidence that is directly about the entity.

When a pronoun sentence contains a confident status signal and the entity's own sentence was weak (confidence < 0.65), the status is attributed to the most recent entity in the same section. Coref is section-scoped to prevent cross-section contamination.

"The patient had a cough. It resolved."
                          ──────────── → cough: resolved (87% conf)

6. Tense-Aspect-Modality (TAM)

Motivation. All keyword-based classifiers — including this system's rule and Bayesian layers — are lexically grounded. They fire on words, not grammatical structure. But grammatical predicate structure carries strong status information that no keyword can capture:

"Symptoms are worsening" — no temporal adverb, but present progressive = ongoing
"Blood pressure should be monitored" — no cue fires, but deontic modal = ongoing obligation
"The infection had resolved" — past perfect = completed before a past reference point → resolved
"Symptoms might be worsening" — strong ongoing cue, but epistemic modal raises uncertainty

Adding "might have been resolving" as a keyword phrase would handle only that exact string. TAM extracts the grammatical signature and handles any unseen combination compositely.

The TAM signature of the governing predicate is extracted and mapped to independent LLRs feeding the same Bayesian log-score accumulator as keyword cues. Tense, aspect, and modality each contribute separately:

"might have been resolving"
  epistemic_weak (modal)  → shifts posterior toward ambiguous
  + progressive (aspect)  → shifts toward ongoing
  = high-entropy posterior → triage flagged

TAM patterns are intentionally conservative — only specific clinical verb constructions, never bare is/has/was — to prevent false positives in negation contexts. "Patient had no fever" fires no TAM signal and stays negated; "The fever had resolved" fires past_perfect and boosts resolved.

7. Attribution-aware confidence

Motivation. TAM captures how the predicate is expressed. Temporal signals capture when. Both are silent about who is asserting the status. A patient who thinks they have hypertension, a family member who reports a seizure history, and an EHR fragment that shows diabetes are qualitatively different claims — carrying different degrees of clinical certainty — and none are captured by keyword, temporal, or TAM signals.

Key novel case — record attribution. "Per records, hypertension" and "Records show asthma" contain no resolved keyword, no temporal adverb, and no TAM signal. Without attribution, the classifier has no resolved evidence and defaults to ongoing/ambiguous. The record source contributes LLR +1.0 to resolved — enough to produce the correct classification from a single signal.

Each asserter type maps to an independent LLR vector:

Source	Clinical meaning	Primary LLR effect
`record`	per records/chart, records show/document	+1.0 to resolved
`patient_hedge`	patient thinks/believes/suspects	+0.80 to ambiguous
`clinician_hedge`	we think/believe, appears consistent with	+0.60 to ambiguous
`family_report`	family/wife/caregiver reports/states	+0.40 to ambiguous
`patient_report`	patient reports/states/endorses	+0.30 to ambiguous

All attribution LLRs are capped at ±1.0. Strong keyword cues (LLR ≈ 6.9 for weight=0.999) always dominate — "Patient denies fever" stays negated even though patient_report fires, because denies (LLR ≈ 3.7) overwhelms the +0.30 attribution signal. "Family history of diabetes" does not trigger family_report because the pattern requires a report verb after the family noun.

8. Status trajectory

Motivation. Every prior approach classifies each entity against the single sentence that contains it. When the same condition appears multiple times within a section, earlier sentences are discarded entirely. This throws away directional evidence: a condition mentioned as ongoing in sentence 1 and resolved in sentence 3 is a resolution — a clinically meaningful transition — not an ambiguous signal to be averaged or ignored.

The sequence of status classifications for a condition across all sentences in a section is reconciled using time-decayed log-evidence accumulation. Each transition type earns a bonus LLR encoding the clinical prior that this transition is meaningful:

Transition	Clinical meaning	Bonus LLR
`ongoing → resolved`	resolution — condition cleared	+0.80
`resolved → ongoing`	relapse — history item now active	+1.00
`negated → ongoing`	contradiction — prior denial overridden	+0.90
`ambiguous → [definite]`	clarification — uncertainty resolved	+0.60

Each point contributes 0.7^(n−1−i) weight — the most recent mention dominates but earlier signals still inform. This is strictly better than "take the last sentence" because early-context confidence bounds the posterior when the final sentence is weak. The transition_type field (relapse, contradiction, multi_transition) feeds directly into the hybrid triage system for clinical review.

Complementarity with coreference. Pronoun coreference handles "Patient has cough. It resolved." (pronoun bridge). Trajectory handles "Patient has cough. Cough resolved." (explicit re-mention). Neither replaces the other — both fire in the same pipeline step.

9. Bayesian evidence fusion

Motivation. The rule-based classifier returns a single confidence score and argmax — no distribution, no uncertainty quantification. Without a proper posterior, there is no principled basis for combining multiple evidence sources, detecting conflicting signals, or routing uncertain predictions to human review. Confidence scores based on how strongly rules fire are not the same as calibrated probabilities of being correct.

Bayesian fusion treats cue weights as calibrated likelihood ratios and accumulates all evidence into a posterior distribution. For each label ℓ:

log_score[ℓ] = log P(label=ℓ | section)       ← section-conditional prior
             + Σ log(w / (1-w))                ← for each cue targeting ℓ
             - Σ log(w' / (1-w')) / 3          ← for each cue targeting ℓ' ≠ ℓ
             + TAM LLR[ℓ]                      ← grammatical predicate structure
             + attribution LLR[ℓ]              ← asserter identity

posterior[ℓ] = softmax(log_score)[ℓ]

All evidence sources — keywords, temporal, TAM, attribution — are orthogonal dimensions summing into the same log-score vector before softmax. Section priors encode clinical domain knowledge: PMH → resolved=0.55; HPI/Assessment → ongoing=0.50.

10. Hybrid triage

Motivation. In clinical settings, the cost of an incorrect classification is high — but routing every prediction for human review defeats the purpose of automation. The goal is to identify which specific predictions are uncertain enough to warrant review, while auto-approving the rest with provably high accuracy. Entropy over the posterior provides exactly this signal: wrong predictions have 2.5× higher entropy than correct ones.

A prediction is flagged when either:

Bayesian entropy > 1.2 bits (tunable default)
The rule-based system and Bayesian system predict different labels

result = classify("History of poorly controlled hypertension.")
# result["entropy"]      → 1.21 bits  (two signals compete)
# result["triage_flag"]  → True       (flagged for review)
# result["runner_up"]    → ("ongoing", 0.47)

11. Calibration transfer

Motivation. Rule-based systems assign confidence scores based on how strongly rules fire, not empirical accuracy — these scores are systematically miscalibrated. Collecting labelled real clinical text for calibration fitting is expensive and requires de-identification. If calibration can be fitted entirely on synthetic data and transferred to real text, the overhead of calibration fitting drops to near zero.

Miscalibration is driven by rule activation patterns, not surface form variation. Calibration models therefore transfer from synthetic to real text.

Isotonic regression fitted on 2,850 synthetic phrases reduces ECE by 97% on real clinical text, confirming the transfer hypothesis.

Design decisions

Why rule-based rather than ML?

Every prediction can be traced to a specific cue, temporal expression, or section prior. An ML baseline (TF-IDF + logistic regression, trained on the same 2,850 synthetic phrases) shows exactly where domain engineering wins:

System	Accuracy	Misclassified
Rule-based (this system)	97%	1 / 39
TF-IDF + logistic regression	90%	4 / 39

Phrase	Gold	ML predicts	Why ML fails
`"Asthma better today"`	`ongoing`	`resolved`	"better" correlates with resolved in training data
`"s/p appendectomy"`	`resolved`	`ongoing`	abbreviation not expanded before ML sees it
`"Cough, no improvement noted"`	`ongoing`	`negated`	ML sees "no"; rule system masks pseudo-negation

Why weighted multi-signal scoring rather than priority order?

The original system used fixed priority: negated → ambiguous → resolved → ongoing. This fails when two signals appear in the same phrase:

"History of asthma, currently worsening"
 → Priority order: resolved (first cue found)
 → Weighted scoring: ongoing (worsening 0.95 + currently 0.90 + bonus > history of 0.95)

Why a two-level pipeline?

The phrase classifier is optimised for short, focused input. The pipeline feeds it exactly what it needs — a single sentence from the appropriate section — without the classifier needing to know how sections, NER, or sentence splitting work.

Results

Phrase-level accuracy

Evaluated on 159 labelled clinical phrases covering all four status labels and TAM-sensitive, attribution-sensitive, and temporally-marked constructions.

System	Accuracy	ECE
Rule-based	89.8%	0.109
Bayesian fusion	87.4%	0.125

Per-label ECE — Bayesian wins on 3 of 4 categories:

Label	Rule ECE	Bayes ECE
ongoing	0.259	0.215
resolved	0.116	0.104
negated	0.142	0.131
ambiguous	0.140	0.190

Reproduce: python experiments/bayesian_fusion_eval.py

TAM contribution

Comparison on 151-phrase set (includes 64 TAM-sensitive phrases), with vs without TAM:

Metric	Without TAM	With TAM	Δ
Accuracy	76.2%	90.1%	+13.9%
ECE	0.065	0.157	+0.091
Mean entropy	1.012 bits	0.915 bits	−9.6%
Predictions changed	—	21 improved, 0 hurt	—

Per-label accuracy on TAM-sensitive constructions (64 phrases):

Label	Without TAM	With TAM	Δ
ambiguous	48%	88%	+40%
resolved	84%	93%	+9%
ongoing	90%	90%	0%
negated	86%	86%	0%

TAM fires on 40/151 (26%) phrases. ECE increases because the system becomes more confident on the new phrases — while those confidences are directionally correct, the isotonic calibrator was fitted on the original 127-phrase distribution and does not re-calibrate for the expanded set.

Reproduce: python experiments/tam_eval.py

Attribution contribution

Comparison on 159-phrase set (includes 8 record-attribution phrases), with vs without attribution:

Metric	Without attribution	With attribution	Δ
Accuracy	86.2%	90.6%	+4.4%
Predictions changed	—	7 improved, 0 hurt	—

Attribution fires on only 8/159 (5%) phrases — all record source — but converts every one from incorrect (ongoing) to correct (resolved). Zero predictions hurt.

Targeted evaluation on 25 attribution-specific cases (all five source types):

Source	n	Without attribution	With attribution	Δ
record	12	25%	100%	+75%
patient_hedge	4	50%	100%	+50%
clinician_hedge	4	0%	75%	+75%
patient_report	5	100%	100%	0%
Overall	25	40%	96%	+56%

patient_report baseline is already 100% because the ongoing cue independently scores the phrase correctly — the attribution signal is additive but not load-bearing. clinician_hedge (75%) misses one case where the hedged phrase has a strong keyword cue that dominates the attribution LLR.

Reproduce: python experiments/attribution_eval.py

Trajectory contribution

Evaluated on 40 multi-sentence passages, each with 2 explicit mentions of the same condition:

Transition type	n	Baseline	Trajectory	Δ
resolution	10	10%	100%	+90%
relapse	10	10%	40%	+30%
contradiction	5	0%	40%	+40%
clarification	5	0%	40%	+40%
stable	10	100%	100%	0%
Overall	40	30%	70%	+40%

16 improved, 0 hurt. Trajectory never degrades a correct single-sentence prediction.

Reproduce: python experiments/trajectory_eval.py

Calibration transfer

Training set: 2,850 synthetic phrases. Test set: 159 real clinical phrases.

Method	ECE	Brier	ECE vs raw
Uncalibrated	0.109	0.091	—
Platt scaling	0.065	0.071	−40%
Isotonic regression	0.003	0.061	−97%
Temperature scaling	0.130	0.095	+20%

Per-category ECE:

Category	n	Uncalibrated	Platt	Isotonic	Temperature
ongoing	42	0.259	0.093	0.153	0.236
resolved	38	0.116	0.083	0.032	0.102
negated	22	0.142	0.115	0.095	0.157
ambiguous	25	0.140	0.176	0.116	0.164

Temperature scaling fails because miscalibration is non-uniform — a single global scalar cannot correct category-specific over/under-confidence. For ongoing, Platt (ECE 0.093) outperforms isotonic (0.153) due to wider score spread; isotonic dominates for resolved and negated.

Reproduce: python experiments/calibration_transfer.py

Hybrid triage

Evaluated on 127-phrase set, default threshold 1.2 bits:

Metric	Value
Phrases flagged	48 / 127 (38%)
Recall of errors	100% — every wrong prediction flagged
Precision	27% — 1 in 3.7 flagged phrases is an actual error
Auto-approved accuracy	100% — 62% of predictions, no errors
Review efficiency	2.6× fewer phrases read per error

Wrong predictions have 2.5× higher entropy than correct ones, making entropy a reliable signal for routing uncertain predictions to human review. At best-F1 threshold (1.8 bits): 21% flagged, 77% recall, 37% precision. The threshold is tunable — evaluate_triage(csv_path, thresholds=[...]) sweeps thresholds and returns precision/recall/F1 at each.

Reproduce: python experiments/hybrid_eval.py

Note-level evaluation

Evaluated on 7 annotated clinical notes (4 general + 3 trajectory-specific: resolution, relapse, clarification), with vs without trajectory:

Metric	Without trajectory	With trajectory	Δ
Precision	0.625	0.684	+9.4%
Recall	0.781	0.848	+8.6%
F1	0.694	0.757	+9.1%

14/14 trajectory-dependent conditions correctly classified across the 3 new notes. The 7 FN errors in both configurations are pre-existing NER misses unrelated to status classification.

Reproduce: python main.py (note-level section)

Architecture

Level 1 — Phrase Classifier

Raw phrase
    │
    ▼
Abbreviation normalizer     h/o → history of, -ve → negative for, s/p → status post
    │
    ▼
Pseudo-negation masking     "no longer", "not improving" → masked before cue matching
    │
    ▼
Multi-signal cue matching   word-boundary regex, weighted scores per category
    │
    ▼
Temporal signal detection   past/present time expressions boost resolved/ongoing
    │
    ▼
Adversative clause check    "But…" / "However…" / "." → classify final clause
    │
    ▼
Conflict detection          competing signals → reduced confidence
    │
    ▼
Platt calibration           raw cue-score → P(correct)
    │
    ▼
{status, confidence, calibrated_confidence, cue, reason, signals}

Level 1b — Hybrid Classifier

Raw phrase + section
    │
    ├──► Rule-based classifier ──► status, confidence, calibrated_confidence
    │
    └──► Bayesian fusion ──────► posterior {label: P}, entropy (bits)
                │
                ▼
         Triage decision    entropy > 1.2 bits OR systems disagree → triage_flag=True
                │
                ▼
{status, posterior, entropy, runner_up, triage_flag, triage_reason, agreement}

Level 2 — Full Note Pipeline

Raw clinical note
    │
    ▼
Section detector            PMH → resolved prior | HPI → ongoing prior
    │
    ▼  (per section)
Abbreviation normalizer  →  Sentence splitter  →  NER
    │
    ▼
Context extraction          sentence containing each entity (not fixed char window)
    │
    ▼
Phrase classifier  →  Dep parser refinement  →  Section prior override
    │
    ▼
Pronoun coreference  →  Trajectory refinement  →  Deduplication
    │
    ▼
[{condition, status, confidence, section, reason, trajectory?}, ...]

Usage

Single phrase

from src.classifier import classify_condition_status
from src.hybrid import classify

# Rule-based only
classify_condition_status("History of asthma, currently worsening")
# → {"status": "ongoing", "confidence": 1.0, "calibrated_confidence": 0.97, ...}

# Hybrid (rule-based + Bayesian uncertainty)
classify("No evidence of pneumonia.")
# → {"status": "negated", "triage_flag": False, "entropy": 0.0, ...}

Full clinical note

from src.pipeline import process_note, format_results

note = """
History of Present Illness:
67-year-old female presenting with worsening dyspnea.
She reports fatigue for 3 days. Denies chest pain or fever.

Past Medical History:
Hypertension, type 2 diabetes mellitus (diagnosed 5 years ago),
h/o pneumonia (resolved last year), atrial fibrillation controlled on medication.

Assessment:
Possible heart failure exacerbation. Rule out pulmonary embolism.
"""
print(format_results(process_note(note)))

CONDITION                           STATUS         CONF  SECTION
----------------------------------------------------------------
dyspnea                             ongoing        100%  history_of_present_illness
chest pain                          negated        100%  history_of_present_illness
fever                               negated        100%  history_of_present_illness
Hypertension                        resolved       100%  past_medical_history
type 2 diabetes mellitus            resolved       100%  past_medical_history
pneumonia                           resolved       100%  past_medical_history
atrial fibrillation                 resolved       100%  past_medical_history
heart failure                       ambiguous       76%  assessment
pulmonary embolism                  ambiguous       95%  assessment

Output schema

classify_condition_status(text):

{
    "status":                "ongoing",
    "confidence":            0.82,
    "calibrated_confidence": 0.95,
    "cue":                   "worsening",
    "reason":                "Ongoing/active cue found: 'worsening' | Temporal hint: present",
    "signals": {
        "negated": 0.0, "ambiguous": 0.0, "resolved": 0.95, "ongoing": 1.0,
        "temporal": "present", "pseudo_negations": [], "clause_used": "full"
    }
}

hybrid.classify(text, section):

{
    "status":                "ongoing",
    "confidence":            0.82,
    "calibrated_confidence": 0.95,
    "posterior":             {"ongoing": 0.71, "resolved": 0.18, "negated": 0.07, "ambiguous": 0.04},
    "entropy":               0.43,
    "runner_up":             ("resolved", 0.18),
    "agreement":             True,
    "triage_flag":           False,
    "triage_reason":         "",
    "rule_reason":           "Ongoing/active cue found: 'worsening'",
    "rule_cue":              "worsening",
    "bayes_status":          "ongoing",
    "signals":               { ... }
}

Streamlit app

streamlit run app.py

Tab	What it does
Single Phrase	Classify a phrase; shows triage flag, posterior bar chart, runner-up, confidence, expanded abbreviations
Full Clinical Note	Paste a note; runs the full pipeline and returns a colour-coded condition table
Evaluate Dataset	Phrase accuracy, reliability diagram + ECE, ML baseline comparison, calibration methods comparison, note-level P/R/F1

Project Structure

condition-status-classifier/
│
├── data/
│   ├── clinical_phrases.csv          159-phrase labelled dataset (incl. TAM-sensitive, record attribution)
│   ├── annotated_notes.json          7 annotated clinical notes (4 original + 3 trajectory-specific)
│   ├── calibration.json              fitted Platt scaler parameters
│   ├── calibration_phrases.csv       2,850 synthetic phrases for calibration fitting
│   └── generate_calibration_dataset.py
│
├── src/
│   ├── normalizer.py                 abbreviation expansion
│   ├── rules.py                      weighted cues (100+ entries) + pseudo-negation patterns
│   ├── temporal.py                   past/present temporal signal detection
│   ├── classifier.py                 phrase-level classifier
│   ├── dep_parser.py                 spaCy dep-tree: negation scope, list negation, temporal scope
│   ├── section_detector.py           note section splitter
│   ├── ner.py                        NER (SciSpaCy primary / vocabulary fallback)
│   ├── sentence_splitter.py          clinical sentence boundary detection
│   ├── coref.py                      pronoun-to-entity coreference within sections
│   ├── trajectory.py                 intra-section status trajectory (time-decay reconciliation)
│   ├── pipeline.py                   full note pipeline (incl. trajectory refinement)
│   ├── calibration.py                Platt scaler + ECE + calibration transfer helpers
│   ├── tam.py                        TAM extraction (tense/aspect/modality → LLRs)
│   ├── attribution.py                attribution-aware confidence (asserter identity → LLRs)
│   ├── bayesian_fusion.py            Bayesian evidence fusion (posterior + entropy + TAM + attribution)
│   ├── hybrid.py                     hybrid classifier (rule-based MAP + Bayesian triage)
│   ├── baseline.py                   TF-IDF + logistic regression baseline
│   ├── note_evaluator.py             pipeline P/R/F1 on annotated notes
│   └── utils.py                      phrase-level dataset evaluation
│
├── experiments/
│   ├── calibration_transfer.py
│   ├── bayesian_fusion_eval.py
│   ├── hybrid_eval.py
│   ├── tam_eval.py
│   ├── trajectory_eval.py
│   └── attribution_eval.py
│
├── tests/
│   ├── test_classifier.py            33 tests
│   ├── test_pipeline.py              44 tests
│   ├── test_coref.py                 21 tests
│   ├── test_dep_and_calibration.py   38 tests
│   ├── test_bayesian_fusion.py       36 tests
│   ├── test_hybrid.py                31 tests
│   ├── test_tam.py                   52 tests
│   ├── test_trajectory.py            29 tests
│   └── test_attribution.py           49 tests
│
├── app.py
├── main.py
├── pytest.ini
└── requirements.txt

Module reference

Module	Role
`src/normalizer.py`	Expands clinical abbreviations (`h/o → history of`, `s/p → status post`, `-ve → negative for`)
`src/rules.py`	Weighted cue lists (100+ entries) ordered most-to-least specific + pseudo-negation registry
`src/temporal.py`	Past/present temporal regex patterns; returns signal and confidence
`src/classifier.py`	Orchestrates phrase-level classification; emits `calibrated_confidence`
`src/dep_parser.py`	spaCy dependency parsing: negation scope, list negation, temporal modifier scope
`src/section_detector.py`	Splits clinical notes into labeled sections; drives section-conditional priors
`src/ner.py`	SciSpaCy primary NER + vocabulary fallback (~85 conditions)
`src/sentence_splitter.py`	Clinical sentence boundary detection with abbreviation protection
`src/coref.py`	Pronoun coreference within sections; fires only when entity confidence < 0.65
`src/trajectory.py`	Intra-section status trajectory: time-decayed log-evidence accumulation + transition-type LLR bonuses
`src/pipeline.py`	Orchestrates the full note-level pipeline; wires dep parser, coref, and trajectory refinement
`src/calibration.py`	Platt scaler (`calibrate()`), reliability diagram, ECE, Brier, isotonic/temperature methods
`src/tam.py`	TAM extraction: tense/aspect/modality → LLRs; compositionality over novel predicate constructions
`src/attribution.py`	Attribution-aware confidence: extracts asserter identity (record/patient/family/clinician) and maps each source to independent LLRs
`src/bayesian_fusion.py`	Bayesian evidence fusion: posterior distribution, Shannon entropy, section priors, TAM + attribution integration
`src/hybrid.py`	Hybrid classifier: rule-based MAP label + Bayesian posterior + entropy triage flag
`src/baseline.py`	TF-IDF + logistic regression baseline
`src/note_evaluator.py`	Precision/recall/F1 evaluation on annotated clinical notes
`src/utils.py`	Phrase-level dataset evaluation helper

Limitations

Limitation	Potential improvement
NER misses rare conditions, specialist terminology, and misspellings	Fine-tuned clinical NER or BERT-based token classifier trained on de-identified notes
Rule-based classification plateaus on novel phrasing and conditions outside the cue vocabulary	Fine-tuned BERT / clinical LLM trained on labelled clinical notes
Platt scaler fitted on 2,850 synthetic phrases — less reliable at the tails	Collect ≥500 real labelled phrases and refit
Dep-tree heuristics can mis-scope modifiers in complex multi-clause sentences	Dedicated relation extraction model linking negation/temporality to entity spans
Pronoun coreference fails when multiple entities are plausible antecedents	Neural coreference resolution (e.g. SpanBERT-based)
Section prior threshold (0.55) is not empirically calibrated	Tune on a held-out annotated note set
Dep parser silently degrades to regex-only if `en_core_web_sm` is not installed	Surface a clear warning in the UI and CLI

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
data		data
experiments		experiments
src		src
tests		tests
.gitignore		.gitignore
README.md		README.md
app.py		app.py
main.py		main.py
pytest.ini		pytest.ini
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Clinical Condition Status Classifier

Quick Start

Status Labels

Architecture

How It Works

1. Pseudo-negation filtering

2. Adversative clause detection

3. Temporal signal detection

4. Sentence-boundary-aware context windows

5. Pronoun coreference within sections

6. Tense-Aspect-Modality (TAM)

7. Attribution-aware confidence

8. Status trajectory

9. Bayesian evidence fusion

10. Hybrid triage

11. Calibration transfer

Design decisions

Results

Phrase-level accuracy

TAM contribution

Attribution contribution

Trajectory contribution

Calibration transfer

Hybrid triage

Note-level evaluation

Architecture

Level 1 — Phrase Classifier

Level 1b — Hybrid Classifier

Level 2 — Full Note Pipeline

Usage

Single phrase

Full clinical note

Output schema

Streamlit app

Project Structure

Module reference

Limitations

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages