decomposer: baseline wrappers (FActScore, MedScore, VeriScore) + evaluation by aymaneo · Pull Request #2 · MedARC-AI/amfv

aymaneo · 2026-06-07T20:34:28Z

Summary

First code contribution to the decomposer/ component - three baseline decomposer wrappers with a unified interface, an evaluation script, and results on AskDocsAI (300 records).

What's in this PR

amfv_decomposer/base.py - BaseDecomposer ABC, split_sentences (spaCy), sliding_window, parse_claims
amfv_decomposer/vllm_client.py - generic vLLM client supporting chat models (Qwen3-8B) and completion models; one cached instance per model
amfv_decomposer/hf_client.py - HuggingFace client for PEFT/LoRA models (transformers + bitsandbytes), used by VeriScore-original
amfv_decomposer/baselines/factscore.py - FActScore prompt, sentence-by-sentence, no context
amfv_decomposer/baselines/medscore.py - MedScore prompt, full-response context per sentence
amfv_decomposer/baselines/veriscore.py - VeriScore sliding-window prompt (3 before, 1 after), Qwen3-8B backbone
amfv_decomposer/baselines/veriscore_original.py - VeriScore with original fine-tuned Mistral-7B (SYX/mistral_based_claim_extractor) via PEFT
evaluate.py - CLI that runs any subset of decomposers and outputs a comparison table

Updated Results on AskDocsAI (300 records)

Method	Model	Claims/Response	Claims/Sentence	0-claim rate	Paper C/R	Paper C/S	Paper 0-claim
FActScore	Qwen3-8B	40.69	5.99	0.0%	28.60	4.24	0%
MedScore	Qwen3-8B	14.04	2.07	0.0%	13.62	2.02	0%
VeriScore-prompt	Qwen3-8B	6.94	1.02	0.3%	-	-	-
VeriScore-original	Mistral-7B-ft	3.89	0.57	15.7%	3.87	0.57	14.67%

Paper models: FActScore used InstructGPT (text-davinci-003); MedScore used GPT-4o-mini; VeriScore used a fine-tuned Mistral-7B-Instruct-v0.2 (SYX/mistral_based_claim_extractor). All prompted decomposers here use Qwen3-8B for a fair prompt comparison on the same backbone. VeriScore-original uses the original fine-tuned Mistral for exact replication.

MedScore and VeriScore-original replicate paper numbers closely, confirming implementation correctness. FActScore is higher with Qwen3-8B (more verbose than InstructGPT on medical text).

to 3.12 for vllm compat

client

bitsandbytes)

apply LoRA adapter separately

CLAassistant · 2026-06-07T20:34:34Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ aymaneo
❌ Aymane Ouraq

Aymane Ouraq seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

dependency

NVIDIA_VISIBLE_DEVICES from CUDA_VISIBLE_DEVICES for Pyxis GPU access

NVIDIA_VISIBLE_DEVICES

/data/hf_cache into container

subprocess

hf_client to fix peft ModuleNotFoundError

- hf_client: lazy imports with Any typing and type: ignore for optional peft dep - pyproject.toml: add transformers to [hf] optional extras - base.py: combine two re.sub calls into one regex pass in parse_claims Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- VLLM_MODEL env var selects the served model (default: Qwen/Qwen3-8B) - VLLM_TP defaults to $SLURM_GPUS_PER_TASK so TP always matches the SLURM allocation automatically — no manual flag needed Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…kDocs

aymaneo · 2026-06-09T17:10:30Z

Model evaluation results - AskDocs (300 records)

New runs on two larger models (Qwen/Qwen3.6-35B-A3B and openai/gpt-oss-120b) using the same
evaluation pipeline and dataset as the initial Qwen3-8B run. Paper column uses each method's original
model (FActScore: InstructGPT, MedScore: GPT-4o-mini, VeriScore: fine-tuned Mistral-7B)

Claims/Response

Method	Paper	Qwen3-8B	Qwen3.6-35B-A3B	gpt-oss-120b
FActScore	28.60	40.69	23.54	44.64
MedScore	13.62	14.04	10.45	12.38
VeriScore	3.87	6.94	3.15	5.30

Claims/Sentence

Method	Paper	Qwen3-8B	Qwen3.6-35B-A3B	gpt-oss-120b
FActScore	4.24	5.99	3.47	6.57
MedScore	2.02	2.07	1.54	1.82
VeriScore	0.57	1.02	0.46	0.78

0-claim rate

Method	Paper	Qwen3-8B	Qwen3.6-35B-A3B	gpt-oss-120b
FActScore	0.0%	0.0%	0.0%	0.0%
MedScore	0.0%	0.0%	0.0%	0.0%
VeriScore	14.67%	0.3%	28.3%	7.3%

aymaneo added 17 commits June 7, 2026 15:24

Add decomposer base utilities and vLLM client

9673ccb

Add decomposer baseline implementations

fdbeccc

Add decomposer evaluation CLI

2ab0b6c

Update decomposer package config and setup docs

b823486

Add SLURM script for decomposer evaluation

3119cd6

add AskDocsAI evaluation datasets

41fa3f2

Update decomposer SLURM eval script

ed72135

exclude n-8

20299bd

lower python requirement

bd2a0ba

to 3.12 for vllm compat

switch to singularity vllm server + openai

4123dfc

client

pin vllm 0.17.x, remove singularity, direct vLLM

4ef6fb8

add baseline eval results on AskDocsAI (Qwen3-8B, 300 records)

0d40348

add veriscore_original (Mistral ft), make vllm_client generic

e2057ca

add hf_client for VeriScore-original (peft +

4201103

bitsandbytes)

fix hf_client: load mistral base explicitly,

7d489c3

apply LoRA adapter separately

fix pad_token for Mistral tokenizer batching

c32e0d2

add VeriScore-original results

cbf96c6

aymaneo force-pushed the decomposer-setup branch from b0daf5e to cbf96c6 Compare June 7, 2026 20:38

aymaneo added 11 commits June 8, 2026 20:11

use pyxis vllm container instead of venv

fd47e10

fix slurm log paths, annotation key, hf_client prompt stripping

5aa70ed

add en_core_web_sm as pyproject

bf10334

dependency

set

4499f2e

NVIDIA_VISIBLE_DEVICES from CUDA_VISIBLE_DEVICES for Pyxis GPU access

debug: print CUDA_VISIBLE_DEVICES, remove CUDA_DEVICE_ORDER

d80ad3c

remove manual

8cb8366

NVIDIA_VISIBLE_DEVICES

minimal slurm config

dfa6ff0

restore --ntasks=1 required by cluster for --gpus-per-task

34c53c7

use /data/hf_cache shared cache

f3e2ead

mount

56ef84f

/data/hf_cache into container

fork multiprocessing before vLLM import to fix Pyxis GPU

7b09d46

subprocess

aymaneo and others added 15 commits June 9, 2026 13:44

switch to vllm server mode to fix Pyxis CUDA subprocess failure

9bc564e

concurrent requests, cleanned code and added optional hf deps

463b731

lazy imports in

4980672

hf_client to fix peft ModuleNotFoundError

Add question context support to decomposers and eval CLI

0aa7603

Add vLLM retries and document Mistral base model caveat

026ee55

Fix SLURM eval scripts

8ae3699

increase context window and max_tokens

755bfce

version results by model and dataset

c216842

fix VLLM_TP (not propagated by Pyxis)

543c5f2

fix

5636426

increase vLLM health-check timeout

4305ddf

cap vLLM concurrent requests to 32 to avoid v1 race condition KeyError

c2b0efe

add evaluation results: Qwen3-8B, Qwen3.6-35B-A3B, gpt-oss-120b on As…

f18f733

…kDocs

aymaneo added 2 commits June 9, 2026 20:25

Add versioned eval outputs, thinking mode, and tests

b471397

Require --model in run_eval.sh and forward eval args

1980a70

warner-benjamin marked this pull request as draft June 9, 2026 18:41

aymaneo and others added 7 commits June 9, 2026 20:47

Update decomposer README and fix run_eval.sh data path handling

6c5165b

fix think

ba402b3

Merge branch 'MedARC-AI:main' into decomposer-setup

1ba7573

Refactor decomposers to a shared pipeline

033df98

Improve VeriScore baseline and claim deduplication

ac3ea59

align MedScore and VeriScore baselines with reference paper prompts

a5177b7

Replace results with latest runs

8595c85

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

decomposer: baseline wrappers (FActScore, MedScore, VeriScore) + evaluation#2

decomposer: baseline wrappers (FActScore, MedScore, VeriScore) + evaluation#2
aymaneo wants to merge 52 commits into
MedARC-AI:mainfrom
aymaneo:decomposer-setup

aymaneo commented Jun 7, 2026 •

edited

Loading

Uh oh!

CLAassistant commented Jun 7, 2026 •

edited

Loading

Uh oh!

aymaneo commented Jun 9, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

aymaneo commented Jun 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's in this PR

Updated Results on AskDocsAI (300 records)

Uh oh!

CLAassistant commented Jun 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aymaneo commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Model evaluation results - AskDocs (300 records)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

aymaneo commented Jun 7, 2026 •

edited

Loading

CLAassistant commented Jun 7, 2026 •

edited

Loading

aymaneo commented Jun 9, 2026 •

edited

Loading