Med-RewardBench: Benchmarking Reward Models and Judges for Medical Multimodal Large Language Models

📅 2025-08-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Medical multimodal large language models (MLLMs) hold significant promise for clinical decision support, yet suffer from a critical gap: the absence of diagnosis-accurate and clinically aligned reward models and evaluation benchmarks. To address this, we introduce Med-RewardBench—the first dedicated multimodal reward modeling benchmark for healthcare—spanning 13 organ systems and 8 clinical specialties, with 1,026 rigorously curated cases annotated by domain experts across three stages and evaluated along six clinically grounded dimensions. Our work establishes the first fine-grained, clinical-domain-specific reward modeling benchmark, enabling systematic assessment of MLLM alignment with expert judgment. Comprehensive evaluation of 32 state-of-the-art MLLMs reveals pervasive misalignment with clinical reasoning. Leveraging our expert-annotated data, we further fine-tune multimodal reward models, achieving substantial improvements in clinical alignment. This benchmark and methodology advance rigorous, clinically meaningful evaluation and optimization of medical MLLMs.

Technology Category

Application Category

📝 Abstract
Multimodal large language models (MLLMs) hold significant potential in medical applications, including disease diagnosis and clinical decision-making. However, these tasks require highly accurate, context-sensitive, and professionally aligned responses, making reliable reward models and judges critical. Despite their importance, medical reward models (MRMs) and judges remain underexplored, with no dedicated benchmarks addressing clinical requirements. Existing benchmarks focus on general MLLM capabilities or evaluate models as solvers, neglecting essential evaluation dimensions like diagnostic accuracy and clinical relevance. To address this, we introduce Med-RewardBench, the first benchmark specifically designed to evaluate MRMs and judges in medical scenarios. Med-RewardBench features a multimodal dataset spanning 13 organ systems and 8 clinical departments, with 1,026 expert-annotated cases. A rigorous three-step process ensures high-quality evaluation data across six clinically critical dimensions. We evaluate 32 state-of-the-art MLLMs, including open-source, proprietary, and medical-specific models, revealing substantial challenges in aligning outputs with expert judgment. Additionally, we develop baseline models that demonstrate substantial performance improvements through fine-tuning.
Problem

Research questions and friction points this paper is trying to address.

Benchmarking medical reward models and judges for MLLMs
Addressing lack of clinical evaluation dimensions in existing benchmarks
Ensuring diagnostic accuracy and clinical relevance in medical responses
Innovation

Methods, ideas, or system contributions that make the work stand out.

Med-RewardBench benchmark for medical reward models
Multimodal dataset spanning 13 organ systems
Baseline models with fine-tuning performance improvements
🔎 Similar Papers
No similar papers found.
Meidan Ding
Meidan Ding
Shenzhen university
computer visionmedical image analysis
Jipeng Zhang
Jipeng Zhang
Hong Kong University of Science and Technology
natural language processingquestion answering
W
Wenxuan Wang
Renmin University of China
C
Cheng-Yi Li
National Yang Ming Chiao Tung University
W
Wei-Chieh Fang
Taipei Veterans General Hospital
H
Hsin-Yu Wu
National Yang Ming Chiao Tung University
H
Haiqin Zhong
School of Biomedical Engineering, Shenzhen University
W
Wenting Chen
City University of Hong Kong
Linlin Shen
Linlin Shen
Shenzhen University
Deep LearningComputer VisionFacial Analysis/RecognitionMedical Image Analysis