MedHallBench: A New Benchmark for Assessing Hallucination in Medical Large Language Models

📅 2024-12-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Medical large language models (MLLMs) frequently generate clinically harmful hallucinations, necessitating high-fidelity evaluation tools. To address this, we introduce MedHallBench—the first comprehensive benchmark specifically designed for hallucination assessment in MLLMs—integrating expert-validated clinical cases and authoritative medical databases to enable multidimensional, clinically interpretable hallucination detection. We propose the Automated Caption Hallucination Measurement Index (ACHMI), a novel metric that quantifies hallucination severity and clinical impact more precisely than conventional metrics. Additionally, we design a medical-domain-adapted RLHF training pipeline and establish a dual-track verification framework combining automated annotation with clinical expert validation. Baseline evaluations across mainstream LLMs demonstrate that ACHMI significantly improves assessment sensitivity and clinical credibility. MedHallBench thus provides a critical evaluation infrastructure for safe, reliable medical AI systems.

Technology Category

Application Category

📝 Abstract
Medical Large Language Models (MLLMs) have demonstrated potential in healthcare applications, yet their propensity for hallucinations -- generating medically implausible or inaccurate information -- presents substantial risks to patient care. This paper introduces MedHallBench, a comprehensive benchmark framework for evaluating and mitigating hallucinations in MLLMs. Our methodology integrates expert-validated medical case scenarios with established medical databases to create a robust evaluation dataset. The framework employs a sophisticated measurement system that combines automated ACHMI (Automatic Caption Hallucination Measurement in Medical Imaging) scoring with rigorous clinical expert evaluations and utilizes reinforcement learning methods to achieve automatic annotation. Through an optimized reinforcement learning from human feedback (RLHF) training pipeline specifically designed for medical applications, MedHallBench enables thorough evaluation of MLLMs across diverse clinical contexts while maintaining stringent accuracy standards. We conducted comparative experiments involving various models, utilizing the benchmark to establish a baseline for widely adopted large language models (LLMs). Our findings indicate that ACHMI provides a more nuanced understanding of the effects of hallucinations compared to traditional metrics, thereby highlighting its advantages in hallucination assessment. This research establishes a foundational framework for enhancing MLLMs' reliability in healthcare settings and presents actionable strategies for addressing the critical challenge of AI hallucinations in medical applications.
Problem

Research questions and friction points this paper is trying to address.

Medical Language Models
Error Detection
Patient Safety
Innovation

Methods, ideas, or system contributions that make the work stand out.

MedHallBench
Automated Annotation
Medical Language Models
🔎 Similar Papers
No similar papers found.