MediEval: A Unified Medical Benchmark for Patient-Contextual and Knowledge-Grounded Reasoning in LLMs

📅 2025-12-23

📈 Citations: 0

✨ Influential: 0

career value

156K/year

🤖 AI Summary

Current medical large language model (LLM) evaluations suffer from fragmentation: either focusing exclusively on biomedical knowledge recall or neglecting rigorous verification of reasoning correctness—thus failing to ensure clinical reliability. To address this, we introduce the first patient-level benchmark integrating real-world electronic health records (EHRs) from MIMIC-IV with structured biomedical knowledge bases (e.g., UMLS). We propose a novel four-quadrant joint evaluation framework that systematically identifies and quantifies critical safety failure modes—including “hallucination-supported” outputs and “fact inversion”—for the first time. Further, we present Counterfactual Risk-aware Fine-tuning (CoRFu), a novel fine-tuning method leveraging asymmetric Direct Preference Optimization (DPO) to suppress high-risk confusions. Experiments demonstrate that CoRFu achieves a +16.4 macro-F1 improvement and completely eliminates fact inversion, significantly outperforming both leading open-source and proprietary medical LLMs in both accuracy and safety.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) are increasingly applied to medicine, yet their adoption is limited by concerns over reliability and safety. Existing evaluations either test factual medical knowledge in isolation or assess patient-level reasoning without verifying correctness, leaving a critical gap. We introduce MediEval, a benchmark that links MIMIC-IV electronic health records (EHRs) to a unified knowledge base built from UMLS and other biomedical vocabularies. MediEval generates diverse factual and counterfactual medical statements within real patient contexts, enabling systematic evaluation across a 4-quadrant framework that jointly considers knowledge grounding and contextual consistency. Using this framework, we identify critical failure modes, including hallucinated support and truth inversion, that current proprietary, open-source, and domain-specific LLMs frequently exhibit. To address these risks, we propose Counterfactual Risk-Aware Fine-tuning (CoRFu), a DPO-based method with an asymmetric penalty targeting unsafe confusions. CoRFu improves by +16.4 macro-F1 points over the base model and eliminates truth inversion errors, demonstrating both higher accuracy and substantially greater safety.

Problem

Research questions and friction points this paper is trying to address.

Evaluates LLMs' reliability in patient-contextual medical reasoning

Identifies critical failure modes like hallucination in medical LLMs

Proposes fine-tuning method to improve safety and accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark links EHRs to unified knowledge base

Framework evaluates knowledge grounding and contextual consistency

DPO-based fine-tuning method with asymmetric penalty

🔎 Similar Papers

No similar papers found.