Draft: Add NICE-based AMFV MVP baseline with Dense+BM25 retrieval and KISS eval by AdepojuJeremy · Pull Request #3 · MedARC-AI/amfv

AdepojuJeremy · 2026-06-07T23:50:47Z

feat(baseline): AMFV MVP pipeline with NICE KISS eval

first end-to-end AMFV MVP baseline using the NICE subset of epfl-llm/guidelines.

Summary

This PR implements the full AMFV baseline pipeline from ingestion through evaluation:

Component	Implementation
Ingestion	NICE ingestion, preprocessing, JSONL export, document chunking
Retrieval	BM25 + Dense (Qwen3 embeddings via Sentence Transformers) with reciprocal-rank fusion
Cache	SQLite verified-claim cache
Pipeline	Cache-first baseline pipeline
Decomposer	LiteLLM-based FActScore-style claim decomposer
Verifier	LiteLLM-based Med-V1-style verifier
Eval	KISS eval generation from NICE chunks + full eval runner with summary metrics

Dense+BM25 was chosen as the Windows-friendly retrieval path. A separate ColBERT/NextPlaid path is being explored in parallel. This baseline provides a direct comparison point.

Eval setup

The KISS eval set is source-grounded. Each case is generated from a NICE source sentence and stores:

input_text — the passage to verify
expected_claim — the atomic claim to check
expected_chunk_id — the relevant source chunk
expected_verdict — strongly_supported
source_metadata — guideline title, section, NICE ID
The eval runner measures:

Metric	Description
`retrieval_hit@k`	Exact chunk retrieved in top-k results
`score_pass_rate`	Verifier confidence score above threshold
`verdict_match_rate`	Predicted verdict matches expected verdict
`case_pass_rate`	All of the above pass for a single case
`mean_best_score`	Average top verifier score across cases

Current results

25-case KISS eval run using Cohere command-a-03-2025 for both decomposition and verification:

Metric	Value
Cases run	25 / 25
Errors	0
Retrieval hit@3	0.76
Score pass rate	0.96
Verdict match rate	0.72
Case pass rate	0.76
Mean best score	0.90

Notes / limitations

Main bottleneck is retrieval. Some failures are exact-chunk misses where the system retrieves a semantically similar NICE chunk or same-topic guidance, but not the expected source chunk.
The KISS eval set is intentionally simple. Some cases are generic or noisy. The generator will need refinement and lots of experiments
Recommended next eval improvement: add same_doc_hit as a secondary metric alongside strict exact-chunk hit.

How to run

1. Set the baseline package path

$env:PYTHONPATH = "baseline"

2. Run the test suite

Remove-Item Env:AMFV_RUN_LLM_TESTS -ErrorAction SilentlyContinue
Remove-Item Env:AMFV_LLM_TEST_PROVIDERS -ErrorAction SilentlyContinue
uv run pytest
uv run ruff check .

Expected:

42+ tests passed
Integration tests skipped unless explicitly enabled
ruff passed

3. Generate the KISS eval set

uv run python -m amfv_baseline.kiss_eval create `
  --chunks data/index/nice_chunks.jsonl `
  --output data/eval/kiss_nice_eval.jsonl `
  --max-cases 25 `
  --max-per-title 2 `
  --max-per-claim-type 8

4. Inspect the eval set

uv run python -m amfv_baseline.kiss_eval inspect `
  --eval data/eval/kiss_nice_eval.jsonl `
  --limit 5

5. Run the full KISS eval

$env:PYTHONPATH = "baseline"
$env:COHERE_API_KEY = "<key>"
 
$args = @(
  "-m", "amfv_baseline.run_kiss_eval",
  "--eval", "data/eval/kiss_nice_eval.jsonl",
  "--output-jsonl", "data/eval/runs/kiss_nice_eval_results.jsonl",
  "--summary-json", "data/eval/runs/kiss_nice_eval_summary.json",
  "--decomposer-model", "command-a-03-2025",
  "--verifier-model", "command-a-03-2025",
  "--top-k", "3",
  "--cache", "data/cache/kiss_eval_cohere.sqlite",
  "--reset-cache",
  "--sleep-seconds", "1"
)
 
uv run python @args

Next steps

Compare Dense+BM25 against the ColBERT/NextPlaid path on the same KISS eval set
Add same_doc_hit as a secondary retrieval metric
Improve KISS eval filtering to remove generic/noisy cases
Improve retrieval setups, calibration and confidence estimation

CLAassistant · 2026-06-07T23:50:56Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

AdepojuJeremy added 8 commits June 8, 2026 00:11

chore(repo): ignore generated local artifacts

98f9c57

feat(datasets): add NICE ingestion from EPFL guidelines

9bfd476

feat(search): add BM25 and dense hybrid retrieval

596c3b1

feat(baseline): add cache-first AMFV pipeline

13da23c

feat(decomposer): add LiteLLM FActScore decomposer

14e5515

feat(verifier): add LiteLLM Med-V1 verifier

8913ffb

feat(eval): add NICE KISS eval runner

e4a6d98

chore(deps): update workspace dependencies and test config

2db8af4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Draft: Add NICE-based AMFV MVP baseline with Dense+BM25 retrieval and KISS eval #3

Draft: Add NICE-based AMFV MVP baseline with Dense+BM25 retrieval and KISS eval #3
AdepojuJeremy wants to merge 8 commits into
MedARC-AI:mainfrom
AdepojuJeremy:jeremiah/nice-ingestion

AdepojuJeremy commented Jun 7, 2026

Uh oh!

CLAassistant commented Jun 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AdepojuJeremy commented Jun 7, 2026

feat(baseline): AMFV MVP pipeline with NICE KISS eval

Summary

Eval setup

Current results

Notes / limitations

How to run

1. Set the baseline package path

2. Run the test suite

3. Generate the KISS eval set

4. Inspect the eval set

5. Run the full KISS eval

Next steps

Uh oh!

CLAassistant commented Jun 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants