๐ค AI Summary
This work addresses the challenge of hallucinations in large language modelโgenerated content within the biomedical domain by introducing Med-V1, a lightweight language model family with only 3 billion parameters. By constructing high-quality synthetic data and unifying five biomedical verification tasks into an evidence attribution format for zero-shot training, Med-V1 outperforms existing baselines by 27.0%โ71.3% across multiple benchmarks, approaching the performance of GPT-5. This study represents the first demonstration of a small-scale model achieving both high-fidelity interpretability and scalable deployment for biomedical claim verification. Furthermore, it pioneers two novel applications: quantifying hallucination rates under varying citation instructions and automatically detecting high-risk evidence misuse in clinical guidelines.
๐ Abstract
Assessing whether an article supports an assertion is essential for hallucination detection and claim verification. While large language models (LLMs) have the potential to automate this task, achieving strong performance requires frontier models such as GPT-5 that are prohibitively expensive to deploy at scale. To efficiently perform biomedical evidence attribution, we present Med-V1, a family of small language models with only three billion parameters. Trained on high-quality synthetic data newly developed in this study, Med-V1 substantially outperforms (+27.0% to +71.3%) its base models on five biomedical benchmarks unified into a verification format. Despite its smaller size, Med-V1 performs comparably to frontier LLMs such as GPT-5, along with high-quality explanations for its predictions. We use Med-V1 to conduct a first-of-its-kind use case study that quantifies hallucinations in LLM-generated answers under different citation instructions. Results show that the format instruction strongly affects citation validity and hallucination, with GPT-5 generating more claims but exhibiting hallucination rates similar to GPT-4o. Additionally, we present a second use case showing that Med-V1 can automatically identify high-stakes evidence misattributions in clinical practice guidelines, revealing potentially negative public health impacts that are otherwise challenging to identify at scale. Overall, Med-V1 provides an efficient and accurate lightweight alternative to frontier LLMs for practical and real-world applications in biomedical evidence attribution and verification tasks. Med-V1 is available at https://github.com/ncbi-nlp/Med-V1.