MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding

📅 2025-01-30

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

Existing medical AI benchmarks inadequately assess expert-level clinical reasoning. Method: We introduce MedBench, the first high-difficulty, specialty-comprehensive medical evaluation benchmark, covering 17 specialties and 11 organ systems with 4,460 questions—divided into text-only and multimodal subsets (integrating imaging, clinical notes, and lab reports). We propose a clinically faithful multimodal evaluation paradigm, combining authentic physician licensing exam items with reasoning-oriented subtasks, and ensure validity and reliability via multi-round expert review, leakage-free data synthesis, cross-modal information alignment modeling, and difficulty-enhanced sampling. Results: Systematic evaluation across 16 state-of-the-art models reveals critical bottlenecks in expert-level medical reasoning, multimodal clinical integration, and long-chain clinical decision-making—establishing MedBench as a novel diagnostic benchmark for medical foundation model assessment and advancement.

Technology Category

Application Category

📝 Abstract

We introduce MedXpertQA, a highly challenging and comprehensive benchmark to evaluate expert-level medical knowledge and advanced reasoning. MedXpertQA includes 4,460 questions spanning 17 specialties and 11 body systems. It includes two subsets, Text for text evaluation and MM for multimodal evaluation. Notably, MM introduces expert-level exam questions with diverse images and rich clinical information, including patient records and examination results, setting it apart from traditional medical multimodal benchmarks with simple QA pairs generated from image captions. MedXpertQA applies rigorous filtering and augmentation to address the insufficient difficulty of existing benchmarks like MedQA, and incorporates specialty board questions to improve clinical relevance and comprehensiveness. We perform data synthesis to mitigate data leakage risk and conduct multiple rounds of expert reviews to ensure accuracy and reliability. We evaluate 16 leading models on MedXpertQA. Moreover, medicine is deeply connected to real-world decision-making, providing a rich and representative setting for assessing reasoning abilities beyond mathematics and code. To this end, we develop a reasoning-oriented subset to facilitate the assessment of o1-like models.

Problem

Research questions and friction points this paper is trying to address.

Medical Knowledge Assessment

Complex Image-Text Analysis

Decision-making Skills

Innovation

Methods, ideas, or system contributions that make the work stand out.

Medical Decision-Making

Multimodal Question Answering

Synthetic Data

🔎 Similar Papers

Benchmarking Large Language Models on Answering and Explaining Challenging Medical Questions