MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding

📅 2025-01-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing medical AI benchmarks inadequately assess expert-level clinical reasoning. Method: We introduce MedBench, the first high-difficulty, specialty-comprehensive medical evaluation benchmark, covering 17 specialties and 11 organ systems with 4,460 questions—divided into text-only and multimodal subsets (integrating imaging, clinical notes, and lab reports). We propose a clinically faithful multimodal evaluation paradigm, combining authentic physician licensing exam items with reasoning-oriented subtasks, and ensure validity and reliability via multi-round expert review, leakage-free data synthesis, cross-modal information alignment modeling, and difficulty-enhanced sampling. Results: Systematic evaluation across 16 state-of-the-art models reveals critical bottlenecks in expert-level medical reasoning, multimodal clinical integration, and long-chain clinical decision-making—establishing MedBench as a novel diagnostic benchmark for medical foundation model assessment and advancement.

Technology Category

Application Category

📝 Abstract
We introduce MedXpertQA, a highly challenging and comprehensive benchmark to evaluate expert-level medical knowledge and advanced reasoning. MedXpertQA includes 4,460 questions spanning 17 specialties and 11 body systems. It includes two subsets, Text for text evaluation and MM for multimodal evaluation. Notably, MM introduces expert-level exam questions with diverse images and rich clinical information, including patient records and examination results, setting it apart from traditional medical multimodal benchmarks with simple QA pairs generated from image captions. MedXpertQA applies rigorous filtering and augmentation to address the insufficient difficulty of existing benchmarks like MedQA, and incorporates specialty board questions to improve clinical relevance and comprehensiveness. We perform data synthesis to mitigate data leakage risk and conduct multiple rounds of expert reviews to ensure accuracy and reliability. We evaluate 16 leading models on MedXpertQA. Moreover, medicine is deeply connected to real-world decision-making, providing a rich and representative setting for assessing reasoning abilities beyond mathematics and code. To this end, we develop a reasoning-oriented subset to facilitate the assessment of o1-like models.
Problem

Research questions and friction points this paper is trying to address.

Medical Knowledge Assessment
Complex Image-Text Analysis
Decision-making Skills
Innovation

Methods, ideas, or system contributions that make the work stand out.

Medical Decision-Making
Multimodal Question Answering
Synthetic Data
🔎 Similar Papers
No similar papers found.
Y
Yuxin Zuo
Tsinghua University, Beijing, China
Shang Qu
Shang Qu
Tsinghua University
AI4Bio
Y
Yifei Li
Tsinghua University, Beijing, China
Z
Zhangren Chen
Tsinghua University, Beijing, China
Xuekai Zhu
Xuekai Zhu
Shanghai Jiao Tong University
Synthetic DataReasoningLanguage Model
Ermo Hua
Ermo Hua
Tsinghua University
Physics-driven Foundation Model
Kaiyan Zhang
Kaiyan Zhang
Tsinghua University
Foundation ModelCollective IntelligenceScientific Intelligence
N
Ning Ding
Tsinghua University, Beijing, China; Shanghai Artificial Intelligence Laboratory, Shanghai, China
B
Bowen Zhou
Tsinghua University, Beijing, China; Shanghai Artificial Intelligence Laboratory, Shanghai, China