decomposer: baseline wrappers (FActScore, MedScore, VeriScore) + evaluation#2
Draft
aymaneo wants to merge 52 commits into
Draft
decomposer: baseline wrappers (FActScore, MedScore, VeriScore) + evaluation#2aymaneo wants to merge 52 commits into
aymaneo wants to merge 52 commits into
Conversation
to 3.12 for vllm compat
bitsandbytes)
apply LoRA adapter separately
|
Aymane Ouraq seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
dependency
NVIDIA_VISIBLE_DEVICES
hf_client to fix peft ModuleNotFoundError
- hf_client: lazy imports with Any typing and type: ignore for optional peft dep - pyproject.toml: add transformers to [hf] optional extras - base.py: combine two re.sub calls into one regex pass in parse_claims Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- VLLM_MODEL env var selects the served model (default: Qwen/Qwen3-8B) - VLLM_TP defaults to $SLURM_GPUS_PER_TASK so TP always matches the SLURM allocation automatically — no manual flag needed Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Author
Model evaluation results - AskDocs (300 records)New runs on two larger models ( Claims/Response
Claims/Sentence
0-claim rate
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
First code contribution to the
decomposer/component - three baseline decomposer wrappers with a unified interface, an evaluation script, and results on AskDocsAI (300 records).What's in this PR
amfv_decomposer/base.py-BaseDecomposerABC,split_sentences(spaCy),sliding_window,parse_claimsamfv_decomposer/vllm_client.py- generic vLLM client supporting chat models (Qwen3-8B) and completion models; one cached instance per modelamfv_decomposer/hf_client.py- HuggingFace client for PEFT/LoRA models (transformers + bitsandbytes), used by VeriScore-originalamfv_decomposer/baselines/factscore.py- FActScore prompt, sentence-by-sentence, no contextamfv_decomposer/baselines/medscore.py- MedScore prompt, full-response context per sentenceamfv_decomposer/baselines/veriscore.py- VeriScore sliding-window prompt (3 before, 1 after), Qwen3-8B backboneamfv_decomposer/baselines/veriscore_original.py- VeriScore with original fine-tuned Mistral-7B (SYX/mistral_based_claim_extractor) via PEFTevaluate.py- CLI that runs any subset of decomposers and outputs a comparison tableUpdated Results on AskDocsAI (300 records)
Paper models: FActScore used InstructGPT (text-davinci-003); MedScore used GPT-4o-mini; VeriScore used a fine-tuned Mistral-7B-Instruct-v0.2 (
SYX/mistral_based_claim_extractor). All prompted decomposers here use Qwen3-8B for a fair prompt comparison on the same backbone. VeriScore-original uses the original fine-tuned Mistral for exact replication.MedScore and VeriScore-original replicate paper numbers closely, confirming implementation correctness. FActScore is higher with Qwen3-8B (more verbose than InstructGPT on medical text).