The Illusion of Readiness: Stress Testing Large Frontier Models on Multimodal Medical Benchmarks

📅 2025-09-22

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

Current multimodal medical large language models (e.g., GPT-5) achieve high scores on mainstream benchmarks despite exhibiting severe fragility—relying on spurious input shortcuts (e.g., textual cues to infer image content), producing answers highly sensitive to irrelevant prompt perturbations, and generating plausible yet hallucinated reasoning. These flaws stem from benchmark designs that over-reward test-taking heuristics and conflate distinct capability dimensions. Method: We propose the first systematic stress-testing framework aligned with clinical practice, incorporating three critical challenges: zero-image input, minimal prompt perturbations, and factual reasoning evaluation—augmented by clinician-in-the-loop scoring across robustness and reasoning fidelity. Contribution/Results: Evaluated across six models and six benchmarks, our framework reveals pervasive brittleness underlying high benchmark scores. It establishes a clinically grounded evaluation paradigm, demonstrating that benchmark performance does not equate to clinical readiness and underscoring the urgent need for more rigorous, domain-informed assessment standards.

Technology Category

Application Category

📝 Abstract

Large frontier models like GPT-5 now achieve top scores on medical benchmarks. But our stress tests tell a different story. Leading systems often guess correctly even when key inputs like images are removed, flip answers under trivial prompt changes, and fabricate convincing yet flawed reasoning. These aren't glitches; they expose how today's benchmarks reward test-taking tricks over medical understanding. We evaluate six flagship models across six widely used benchmarks and find that high leaderboard scores hide brittleness and shortcut learning. Through clinician-guided rubric evaluation, we show that benchmarks vary widely in what they truly measure yet are treated interchangeably, masking failure modes. We caution that medical benchmark scores do not directly reflect real-world readiness. If we want AI to earn trust in healthcare, we must demand more than leaderboard wins and must hold systems accountable for robustness, sound reasoning, and alignment with real medical demands.

Problem

Research questions and friction points this paper is trying to address.

Stress testing reveals brittleness in medical AI benchmarks

Benchmark scores mask shortcut learning and flawed reasoning

Leaderboard performance does not reflect real-world clinical readiness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Stress testing models by removing key inputs like images

Evaluating brittleness through trivial prompt changes

Using clinician-guided rubric evaluation for real assessment

🔎 Similar Papers

No similar papers found.