🤖 AI Summary
Evaluating the zero-shot multimodal reasoning capabilities of large language models—particularly for clinical decision support integrating textual (e.g., patient narratives, structured electronic health records) and visual (e.g., medical imaging) data—remains a critical yet underexplored challenge.
Method: We propose a unified evaluation framework incorporating zero-shot chain-of-thought reasoning, cross-modal contextual alignment, and end-to-end clinical decision chain modeling. We benchmark GPT-5 on MedQA, VQA-RAD, and our newly constructed MedXpertQA-MM—a rigorous, expert-validated multimodal medical QA dataset.
Contribution/Results: GPT-5 achieves state-of-the-art performance across all benchmarks, attaining a 29.3% absolute accuracy gain over GPT-4o on MedXpertQA-MM and, for the first time, surpassing human expert performance in zero-shot multimodal medical question answering. These results establish a new paradigm for developing trustworthy, interpretable, and clinically grounded multimodal decision support systems, backed by empirical validation.
📝 Abstract
Recent advances in large language models (LLMs) have enabled general-purpose systems to perform increasingly complex domain-specific reasoning without extensive fine-tuning. In the medical domain, decision-making often requires integrating heterogeneous information sources, including patient narratives, structured data, and medical images. This study positions GPT-5 as a generalist multimodal reasoner for medical decision support and systematically evaluates its zero-shot chain-of-thought reasoning performance on both text-based question answering and visual question answering tasks under a unified protocol. We benchmark GPT-5, GPT-5-mini, GPT-5-nano, and GPT-4o-2024-11-20 against standardized splits of MedQA, MedXpertQA (text and multimodal), MMLU medical subsets, USMLE self-assessment exams, and VQA-RAD. Results show that GPT-5 consistently outperforms all baselines, achieving state-of-the-art accuracy across all QA benchmarks and delivering substantial gains in multimodal reasoning. On MedXpertQA MM, GPT-5 improves reasoning and understanding scores by +29.62% and +36.18% over GPT-4o, respectively, and surpasses pre-licensed human experts by +24.23% in reasoning and +29.40% in understanding. In contrast, GPT-4o remains below human expert performance in most dimensions. A representative case study demonstrates GPT-5's ability to integrate visual and textual cues into a coherent diagnostic reasoning chain, recommending appropriate high-stakes interventions. Our results show that, on these controlled multimodal reasoning benchmarks, GPT-5 moves from human-comparable to above human-expert performance. This improvement may substantially inform the design of future clinical decision-support systems.