🤖 AI Summary
This work addresses the lack of effective evaluation of clinical intent from the physician’s first-person perspective in existing multimodal medical large language models. To this end, we propose MedGaze-Bench, a novel benchmark that establishes the first three-dimensional (spatial, temporal, and normative) framework for assessing clinical intent understanding grounded in real physician gaze behavior. The benchmark further incorporates a trap-question mechanism to detect model hallucinations and cognitive compliance—tendencies to fabricate observations or uncritically accept erroneous instructions. Experimental results reveal that current models heavily rely on global features and struggle to accurately capture first-person clinical intent, often generating plausible but unfounded responses or blindly following incorrect prompts. These findings underscore the critical role of MedGaze-Bench in exposing model limitations and advancing the development of reliable, clinically grounded AI systems.
📝 Abstract
Medical Multimodal Large Language Models (Med-MLLMs) require egocentric clinical intent understanding for real-world deployment, yet existing benchmarks fail to evaluate this critical capability. To address these challenges, we introduce MedGaze-Bench, the first benchmark leveraging clinician gaze as a Cognitive Cursor to assess intent understanding across surgery, emergency simulation, and diagnostic interpretation. Our benchmark addresses three fundamental challenges: visual homogeneity of anatomical structures, strict temporal-causal dependencies in clinical workflows, and implicit adherence to safety protocols. We propose a Three-Dimensional Clinical Intent Framework evaluating: (1) Spatial Intent: discriminating precise targets amid visual noise, (2) Temporal Intent: inferring causal rationale through retrospective and prospective reasoning, and (3) Standard Intent: verifying protocol compliance through safety checks. Beyond accuracy metrics, we introduce Trap QA mechanisms to stress-test clinical reliability by penalizing hallucinations and cognitive sycophancy. Experiments reveal current MLLMs struggle with egocentric intent due to over-reliance on global features, leading to fabricated observations and uncritical acceptance of invalid instructions.