Skip to content

decomposer: baseline wrappers (FActScore, MedScore, VeriScore) + evaluation#2

Draft
aymaneo wants to merge 52 commits into
MedARC-AI:mainfrom
aymaneo:decomposer-setup
Draft

decomposer: baseline wrappers (FActScore, MedScore, VeriScore) + evaluation#2
aymaneo wants to merge 52 commits into
MedARC-AI:mainfrom
aymaneo:decomposer-setup

Conversation

@aymaneo

@aymaneo aymaneo commented Jun 7, 2026

Copy link
Copy Markdown

Summary

First code contribution to the decomposer/ component - three baseline decomposer wrappers with a unified interface, an evaluation script, and results on AskDocsAI (300 records).

What's in this PR

  • amfv_decomposer/base.py - BaseDecomposer ABC, split_sentences (spaCy), sliding_window, parse_claims
  • amfv_decomposer/vllm_client.py - generic vLLM client supporting chat models (Qwen3-8B) and completion models; one cached instance per model
  • amfv_decomposer/hf_client.py - HuggingFace client for PEFT/LoRA models (transformers + bitsandbytes), used by VeriScore-original
  • amfv_decomposer/baselines/factscore.py - FActScore prompt, sentence-by-sentence, no context
  • amfv_decomposer/baselines/medscore.py - MedScore prompt, full-response context per sentence
  • amfv_decomposer/baselines/veriscore.py - VeriScore sliding-window prompt (3 before, 1 after), Qwen3-8B backbone
  • amfv_decomposer/baselines/veriscore_original.py - VeriScore with original fine-tuned Mistral-7B (SYX/mistral_based_claim_extractor) via PEFT
  • evaluate.py - CLI that runs any subset of decomposers and outputs a comparison table

Updated Results on AskDocsAI (300 records)

Method Model Claims/Response Claims/Sentence 0-claim rate Paper C/R Paper C/S Paper 0-claim
FActScore Qwen3-8B 40.69 5.99 0.0% 28.60 4.24 0%
MedScore Qwen3-8B 14.04 2.07 0.0% 13.62 2.02 0%
VeriScore-prompt Qwen3-8B 6.94 1.02 0.3% - - -
VeriScore-original Mistral-7B-ft 3.89 0.57 15.7% 3.87 0.57 14.67%

Paper models: FActScore used InstructGPT (text-davinci-003); MedScore used GPT-4o-mini; VeriScore used a fine-tuned Mistral-7B-Instruct-v0.2 (SYX/mistral_based_claim_extractor). All prompted decomposers here use Qwen3-8B for a fair prompt comparison on the same backbone. VeriScore-original uses the original fine-tuned Mistral for exact replication.

MedScore and VeriScore-original replicate paper numbers closely, confirming implementation correctness. FActScore is higher with Qwen3-8B (more verbose than InstructGPT on medical text).

@CLAassistant

CLAassistant commented Jun 7, 2026

Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ aymaneo
❌ Aymane Ouraq


Aymane Ouraq seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

@aymaneo aymaneo force-pushed the decomposer-setup branch from b0daf5e to cbf96c6 Compare June 7, 2026 20:38
aymaneo and others added 15 commits June 9, 2026 13:44
  hf_client to fix peft ModuleNotFoundError
- hf_client: lazy imports with Any typing and type: ignore for optional peft dep
- pyproject.toml: add transformers to [hf] optional extras
- base.py: combine two re.sub calls into one regex pass in parse_claims

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- VLLM_MODEL env var selects the served model (default: Qwen/Qwen3-8B)
- VLLM_TP defaults to $SLURM_GPUS_PER_TASK so TP always matches the
  SLURM allocation automatically — no manual flag needed

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@aymaneo

aymaneo commented Jun 9, 2026

Copy link
Copy Markdown
Author

Model evaluation results - AskDocs (300 records)

New runs on two larger models (Qwen/Qwen3.6-35B-A3B and openai/gpt-oss-120b) using the same
evaluation pipeline and dataset as the initial Qwen3-8B run. Paper column uses each method's original
model (FActScore: InstructGPT, MedScore: GPT-4o-mini, VeriScore: fine-tuned Mistral-7B)

Claims/Response

Method Paper Qwen3-8B Qwen3.6-35B-A3B gpt-oss-120b
FActScore 28.60 40.69 23.54 44.64
MedScore 13.62 14.04 10.45 12.38
VeriScore 3.87 6.94 3.15 5.30

Claims/Sentence

Method Paper Qwen3-8B Qwen3.6-35B-A3B gpt-oss-120b
FActScore 4.24 5.99 3.47 6.57
MedScore 2.02 2.07 1.54 1.82
VeriScore 0.57 1.02 0.46 0.78

0-claim rate

Method Paper Qwen3-8B Qwen3.6-35B-A3B gpt-oss-120b
FActScore 0.0% 0.0% 0.0% 0.0%
MedScore 0.0% 0.0% 0.0% 0.0%
VeriScore 14.67% 0.3% 28.3% 7.3%

@warner-benjamin warner-benjamin marked this pull request as draft June 9, 2026 18:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants