🤖 AI Summary
Medical large language models (LLMs) frequently exhibit factual hallucinations, yet existing evaluation benchmarks lack fine-grained, automated assessment of factual accuracy. To address this, we propose FActBench—the first fine-grained, automated factuality benchmark tailored for medical text generation, covering four medical generation tasks and six mainstream LLMs. Methodologically, FActBench innovatively integrates chain-of-thought prompting, natural language inference (NLI), and a multi-source voting mechanism, all rigorously validated by expert physician annotations. Experiments demonstrate strong agreement between FActBench’s automated scores and expert judgments (Spearman’s ρ > 0.92), significantly outperforming baseline approaches. The framework exhibits robustness and consistency across diverse models and tasks. FActBench provides a reproducible, scalable, and standardized evaluation framework for assessing factual fidelity and mitigating hallucinations in medical LLMs.
📝 Abstract
Large Language Models tend to struggle when dealing with specialized domains. While all aspects of evaluation hold importance, factuality is the most critical one. Similarly, reliable fact-checking tools and data sources are essential for hallucination mitigation. We address these issues by providing a comprehensive Fact-checking Benchmark FActBench covering four generation tasks and six state-of-the-art Large Language Models (LLMs) for the Medical domain. We use two state-of-the-art Fact-checking techniques: Chain-of-Thought (CoT) Prompting and Natural Language Inference (NLI). Our experiments show that the fact-checking scores acquired through the Unanimous Voting of both techniques correlate best with Domain Expert Evaluation.