🤖 AI Summary
Current multimodal medical large language models (e.g., GPT-5) achieve high scores on mainstream benchmarks despite exhibiting severe fragility—relying on spurious input shortcuts (e.g., textual cues to infer image content), producing answers highly sensitive to irrelevant prompt perturbations, and generating plausible yet hallucinated reasoning. These flaws stem from benchmark designs that over-reward test-taking heuristics and conflate distinct capability dimensions. Method: We propose the first systematic stress-testing framework aligned with clinical practice, incorporating three critical challenges: zero-image input, minimal prompt perturbations, and factual reasoning evaluation—augmented by clinician-in-the-loop scoring across robustness and reasoning fidelity. Contribution/Results: Evaluated across six models and six benchmarks, our framework reveals pervasive brittleness underlying high benchmark scores. It establishes a clinically grounded evaluation paradigm, demonstrating that benchmark performance does not equate to clinical readiness and underscoring the urgent need for more rigorous, domain-informed assessment standards.
📝 Abstract
Large frontier models like GPT-5 now achieve top scores on medical benchmarks. But our stress tests tell a different story. Leading systems often guess correctly even when key inputs like images are removed, flip answers under trivial prompt changes, and fabricate convincing yet flawed reasoning. These aren't glitches; they expose how today's benchmarks reward test-taking tricks over medical understanding. We evaluate six flagship models across six widely used benchmarks and find that high leaderboard scores hide brittleness and shortcut learning. Through clinician-guided rubric evaluation, we show that benchmarks vary widely in what they truly measure yet are treated interchangeably, masking failure modes. We caution that medical benchmark scores do not directly reflect real-world readiness. If we want AI to earn trust in healthcare, we must demand more than leaderboard wins and must hold systems accountable for robustness, sound reasoning, and alignment with real medical demands.