Skip to content

Draft: Add NICE-based AMFV MVP baseline with Dense+BM25 retrieval and KISS eval #3

Draft
AdepojuJeremy wants to merge 8 commits into
MedARC-AI:mainfrom
AdepojuJeremy:jeremiah/nice-ingestion
Draft

Draft: Add NICE-based AMFV MVP baseline with Dense+BM25 retrieval and KISS eval #3
AdepojuJeremy wants to merge 8 commits into
MedARC-AI:mainfrom
AdepojuJeremy:jeremiah/nice-ingestion

Conversation

@AdepojuJeremy

Copy link
Copy Markdown

feat(baseline): AMFV MVP pipeline with NICE KISS eval

first end-to-end AMFV MVP baseline using the NICE subset of epfl-llm/guidelines.


Summary

This PR implements the full AMFV baseline pipeline from ingestion through evaluation:

Component Implementation
Ingestion NICE ingestion, preprocessing, JSONL export, document chunking
Retrieval BM25 + Dense (Qwen3 embeddings via Sentence Transformers) with reciprocal-rank fusion
Cache SQLite verified-claim cache
Pipeline Cache-first baseline pipeline
Decomposer LiteLLM-based FActScore-style claim decomposer
Verifier LiteLLM-based Med-V1-style verifier
Eval KISS eval generation from NICE chunks + full eval runner with summary metrics

Dense+BM25 was chosen as the Windows-friendly retrieval path. A separate ColBERT/NextPlaid path is being explored in parallel. This baseline provides a direct comparison point.


Eval setup

The KISS eval set is source-grounded. Each case is generated from a NICE source sentence and stores:

  • input_text — the passage to verify
  • expected_claim — the atomic claim to check
  • expected_chunk_id — the relevant source chunk
  • expected_verdictstrongly_supported
  • source_metadata — guideline title, section, NICE ID
    The eval runner measures:
Metric Description
retrieval_hit@k Exact chunk retrieved in top-k results
score_pass_rate Verifier confidence score above threshold
verdict_match_rate Predicted verdict matches expected verdict
case_pass_rate All of the above pass for a single case
mean_best_score Average top verifier score across cases

Current results

25-case KISS eval run using Cohere command-a-03-2025 for both decomposition and verification:

Metric Value
Cases run 25 / 25
Errors 0
Retrieval hit@3 0.76
Score pass rate 0.96
Verdict match rate 0.72
Case pass rate 0.76
Mean best score 0.90

Notes / limitations

  • Main bottleneck is retrieval. Some failures are exact-chunk misses where the system retrieves a semantically similar NICE chunk or same-topic guidance, but not the expected source chunk.
  • The KISS eval set is intentionally simple. Some cases are generic or noisy. The generator will need refinement and lots of experiments
  • Recommended next eval improvement: add same_doc_hit as a secondary metric alongside strict exact-chunk hit.

How to run

1. Set the baseline package path

$env:PYTHONPATH = "baseline"

2. Run the test suite

Remove-Item Env:AMFV_RUN_LLM_TESTS -ErrorAction SilentlyContinue
Remove-Item Env:AMFV_LLM_TEST_PROVIDERS -ErrorAction SilentlyContinue
uv run pytest
uv run ruff check .

Expected:

  • 42+ tests passed
  • Integration tests skipped unless explicitly enabled
  • ruff passed

3. Generate the KISS eval set

uv run python -m amfv_baseline.kiss_eval create `
  --chunks data/index/nice_chunks.jsonl `
  --output data/eval/kiss_nice_eval.jsonl `
  --max-cases 25 `
  --max-per-title 2 `
  --max-per-claim-type 8

4. Inspect the eval set

uv run python -m amfv_baseline.kiss_eval inspect `
  --eval data/eval/kiss_nice_eval.jsonl `
  --limit 5

5. Run the full KISS eval

$env:PYTHONPATH = "baseline"
$env:COHERE_API_KEY = "<key>"
 
$args = @(
  "-m", "amfv_baseline.run_kiss_eval",
  "--eval", "data/eval/kiss_nice_eval.jsonl",
  "--output-jsonl", "data/eval/runs/kiss_nice_eval_results.jsonl",
  "--summary-json", "data/eval/runs/kiss_nice_eval_summary.json",
  "--decomposer-model", "command-a-03-2025",
  "--verifier-model", "command-a-03-2025",
  "--top-k", "3",
  "--cache", "data/cache/kiss_eval_cohere.sqlite",
  "--reset-cache",
  "--sleep-seconds", "1"
)
 
uv run python @args

Next steps

  • Compare Dense+BM25 against the ColBERT/NextPlaid path on the same KISS eval set
  • Add same_doc_hit as a secondary retrieval metric
  • Improve KISS eval filtering to remove generic/noisy cases
  • Improve retrieval setups, calibration and confidence estimation

@CLAassistant

Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants