The Illusion of Readiness: Stress Testing Large Frontier Models on Multimodal Medical Benchmarks

📅 2025-09-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current multimodal medical large language models (e.g., GPT-5) achieve high scores on mainstream benchmarks despite exhibiting severe fragility—relying on spurious input shortcuts (e.g., textual cues to infer image content), producing answers highly sensitive to irrelevant prompt perturbations, and generating plausible yet hallucinated reasoning. These flaws stem from benchmark designs that over-reward test-taking heuristics and conflate distinct capability dimensions. Method: We propose the first systematic stress-testing framework aligned with clinical practice, incorporating three critical challenges: zero-image input, minimal prompt perturbations, and factual reasoning evaluation—augmented by clinician-in-the-loop scoring across robustness and reasoning fidelity. Contribution/Results: Evaluated across six models and six benchmarks, our framework reveals pervasive brittleness underlying high benchmark scores. It establishes a clinically grounded evaluation paradigm, demonstrating that benchmark performance does not equate to clinical readiness and underscoring the urgent need for more rigorous, domain-informed assessment standards.

Technology Category

Application Category

📝 Abstract
Large frontier models like GPT-5 now achieve top scores on medical benchmarks. But our stress tests tell a different story. Leading systems often guess correctly even when key inputs like images are removed, flip answers under trivial prompt changes, and fabricate convincing yet flawed reasoning. These aren't glitches; they expose how today's benchmarks reward test-taking tricks over medical understanding. We evaluate six flagship models across six widely used benchmarks and find that high leaderboard scores hide brittleness and shortcut learning. Through clinician-guided rubric evaluation, we show that benchmarks vary widely in what they truly measure yet are treated interchangeably, masking failure modes. We caution that medical benchmark scores do not directly reflect real-world readiness. If we want AI to earn trust in healthcare, we must demand more than leaderboard wins and must hold systems accountable for robustness, sound reasoning, and alignment with real medical demands.
Problem

Research questions and friction points this paper is trying to address.

Stress testing reveals brittleness in medical AI benchmarks
Benchmark scores mask shortcut learning and flawed reasoning
Leaderboard performance does not reflect real-world clinical readiness
Innovation

Methods, ideas, or system contributions that make the work stand out.

Stress testing models by removing key inputs like images
Evaluating brittleness through trivial prompt changes
Using clinician-guided rubric evaluation for real assessment
🔎 Similar Papers
No similar papers found.
Y
Yu Gu
Microsoft Research, Health & Life Sciences.
Jingjing Fu
Jingjing Fu
MS
image/video processing
X
Xiaodong Liu
Microsoft Research, Health & Life Sciences.
J
Jeya Maria Jose Valanarasu
Microsoft Research, Health & Life Sciences.
Noel Codella
Noel Codella
Principal Researcher @ Microsoft
Artificial IntelligenceMachine LearningComputer Vision
R
Reuben Tan
Microsoft Research, Health & Life Sciences.
Qianchu Liu
Qianchu Liu
Microsoft Research
Natural Language Processing
Y
Ying Jin
Microsoft Research, Health & Life Sciences.
S
Sheng Zhang
Microsoft Research, Health & Life Sciences.
J
Jinyu Wang
Microsoft Research, Health & Life Sciences.
R
Rui Wang
Microsoft Research, Health & Life Sciences.
L
Lei Song
Microsoft Research, Health & Life Sciences.
Guanghui Qin
Guanghui Qin
Microsoft
machine learninghealthcare
Naoto Usuyama
Naoto Usuyama
Principal Researcher, Microsoft Research
Artificial IntelligencePrecision MedicineComputer VisionNatural Language Processing
C
Cliff Wong
Microsoft Research, Health & Life Sciences.
Cheng Hao
Cheng Hao
Hebei University of Technology
Few-shot learningClass-incremental learning
H
Hohin Lee
Microsoft Research, Health & Life Sciences.
P
Praneeth Sanapathi
Microsoft Research, Health & Life Sciences.
S
Sarah Hilado
Microsoft Research, Health & Life Sciences.
B
Bian Jiang
Microsoft Research, Health & Life Sciences.
J
Javier Alvarez-Valle
Microsoft Research, Health & Life Sciences.
M
Mu Wei
Microsoft Research, Health & Life Sciences.
J
Jianfeng Gao
Microsoft Research, Health & Life Sciences.
Eric Horvitz
Eric Horvitz
Microsoft
Machine intelligencedecision theorydecisions under uncertaintyinformation retrievalbounded
M
Matt Lungren
Microsoft Research, Health & Life Sciences.