FActBench: A Benchmark for Fine-grained Automatic Evaluation of LLM-Generated Text in the Medical Domain

📅 2025-09-02

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Medical large language models (LLMs) frequently exhibit factual hallucinations, yet existing evaluation benchmarks lack fine-grained, automated assessment of factual accuracy. To address this, we propose FActBench—the first fine-grained, automated factuality benchmark tailored for medical text generation, covering four medical generation tasks and six mainstream LLMs. Methodologically, FActBench innovatively integrates chain-of-thought prompting, natural language inference (NLI), and a multi-source voting mechanism, all rigorously validated by expert physician annotations. Experiments demonstrate strong agreement between FActBench’s automated scores and expert judgments (Spearman’s ρ > 0.92), significantly outperforming baseline approaches. The framework exhibits robustness and consistency across diverse models and tasks. FActBench provides a reproducible, scalable, and standardized evaluation framework for assessing factual fidelity and mitigating hallucinations in medical LLMs.

Technology Category

Application Category

📝 Abstract

Large Language Models tend to struggle when dealing with specialized domains. While all aspects of evaluation hold importance, factuality is the most critical one. Similarly, reliable fact-checking tools and data sources are essential for hallucination mitigation. We address these issues by providing a comprehensive Fact-checking Benchmark FActBench covering four generation tasks and six state-of-the-art Large Language Models (LLMs) for the Medical domain. We use two state-of-the-art Fact-checking techniques: Chain-of-Thought (CoT) Prompting and Natural Language Inference (NLI). Our experiments show that the fact-checking scores acquired through the Unanimous Voting of both techniques correlate best with Domain Expert Evaluation.

Problem

Research questions and friction points this paper is trying to address.

Evaluating factuality in medical LLM-generated texts

Addressing hallucination mitigation with reliable tools

Benchmarking multiple LLMs across medical generation tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Medical domain benchmark FActBench for evaluation

Combines Chain-of-Thought and NLI fact-checking techniques

Uses unanimous voting for expert-correlated scoring

🔎 Similar Papers

Factual consistency evaluation of summarization in the Era of large language models