🤖 AI Summary
Existing medical AI benchmarks inadequately assess expert-level clinical reasoning. Method: We introduce MedBench, the first high-difficulty, specialty-comprehensive medical evaluation benchmark, covering 17 specialties and 11 organ systems with 4,460 questions—divided into text-only and multimodal subsets (integrating imaging, clinical notes, and lab reports). We propose a clinically faithful multimodal evaluation paradigm, combining authentic physician licensing exam items with reasoning-oriented subtasks, and ensure validity and reliability via multi-round expert review, leakage-free data synthesis, cross-modal information alignment modeling, and difficulty-enhanced sampling. Results: Systematic evaluation across 16 state-of-the-art models reveals critical bottlenecks in expert-level medical reasoning, multimodal clinical integration, and long-chain clinical decision-making—establishing MedBench as a novel diagnostic benchmark for medical foundation model assessment and advancement.
📝 Abstract
We introduce MedXpertQA, a highly challenging and comprehensive benchmark to evaluate expert-level medical knowledge and advanced reasoning. MedXpertQA includes 4,460 questions spanning 17 specialties and 11 body systems. It includes two subsets, Text for text evaluation and MM for multimodal evaluation. Notably, MM introduces expert-level exam questions with diverse images and rich clinical information, including patient records and examination results, setting it apart from traditional medical multimodal benchmarks with simple QA pairs generated from image captions. MedXpertQA applies rigorous filtering and augmentation to address the insufficient difficulty of existing benchmarks like MedQA, and incorporates specialty board questions to improve clinical relevance and comprehensiveness. We perform data synthesis to mitigate data leakage risk and conduct multiple rounds of expert reviews to ensure accuracy and reliability. We evaluate 16 leading models on MedXpertQA. Moreover, medicine is deeply connected to real-world decision-making, providing a rich and representative setting for assessing reasoning abilities beyond mathematics and code. To this end, we develop a reasoning-oriented subset to facilitate the assessment of o1-like models.