Benchmarking GPT-5 for Zero-Shot Multimodal Medical Reasoning in Radiology and Radiation Oncology

📅 2025-08-15
📈 Citations: 0
Influential: 0
📄 PDF

career value

180K/year
🤖 AI Summary
This study addresses the critical need to rigorously evaluate large multimodal models’ zero-shot clinical reasoning capabilities in high-stakes medical domains—specifically radiology and radiation oncology—where accurate integration of medical images, textual reports, and quantitative data is essential for safe decision-making. Method: We conduct the first systematic assessment of GPT-5 on three complementary tasks—visual question answering (VQA-RAD), cross-modal alignment (SLAKE), and specialized medical physics problem solving (a novel, expert-curated dataset)—all under strict zero-shot conditions. Contribution/Results: GPT-5 achieves a 20.00% average accuracy gain over GPT-4o across tasks, attaining 90.7% accuracy on medical physics questions—exceeding the estimated human pass threshold for the first time. Our work establishes the inaugural zero-shot multimodal evaluation framework tailored to high-risk clinical applications and demonstrates GPT-5’s emergent competence in complex anatomical interpretation and quantitative radiotherapy reasoning, underscoring its potential as a clinically viable decision-support tool.

Technology Category

Application Category

📝 Abstract
Radiology, radiation oncology, and medical physics require decision-making that integrates medical images, textual reports, and quantitative data under high-stakes conditions. With the introduction of GPT-5, it is critical to assess whether recent advances in large multimodal models translate into measurable gains in these safety-critical domains. We present a targeted zero-shot evaluation of GPT-5 and its smaller variants (GPT-5-mini, GPT-5-nano) against GPT-4o across three representative tasks. We present a targeted zero-shot evaluation of GPT-5 and its smaller variants (GPT-5-mini, GPT-5-nano) against GPT-4o across three representative tasks: (1) VQA-RAD, a benchmark for visual question answering in radiology; (2) SLAKE, a semantically annotated, multilingual VQA dataset testing cross-modal grounding; and (3) a curated Medical Physics Board Examination-style dataset of 150 multiple-choice questions spanning treatment planning, dosimetry, imaging, and quality assurance. Across all datasets, GPT-5 achieved the highest accuracy, with substantial gains over GPT-4o up to +20.00% in challenging anatomical regions such as the chest-mediastinal, +13.60% in lung-focused questions, and +11.44% in brain-tissue interpretation. On the board-style physics questions, GPT-5 attained 90.7% accuracy (136/150), exceeding the estimated human passing threshold, while GPT-4o trailed at 78.0%. These results demonstrate that GPT-5 delivers consistent and often pronounced performance improvements over GPT-4o in both image-grounded reasoning and domain-specific numerical problem-solving, highlighting its potential to augment expert workflows in medical imaging and therapeutic physics.
Problem

Research questions and friction points this paper is trying to address.

Evaluating GPT-5's zero-shot multimodal reasoning in medical domains
Assessing performance gains over GPT-4o in radiology and radiation oncology
Testing model accuracy on medical imaging and physics board questions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Zero-shot evaluation of GPT-5 variants
Multimodal medical reasoning benchmark testing
Performance comparison against GPT-4o