Thinking with Gaze: Sequential Eye-Tracking as Visual Reasoning Supervision for Medical VLMs

📅 2026-03-05

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

This work addresses a critical limitation in current medical vision-language models: their lack of an evidence acquisition mechanism aligned with radiologists’ visual search behavior, often relying excessively on intermediate textual representations. To bridge this gap, the study introduces gaze tokens guided by temporal eye-tracking trajectories as a novel form of supervision, enabling the model to attend to diagnostically relevant image regions in a temporally coherent sequence that explicitly mirrors human visual reasoning during diagnosis. Evaluated on the MIMIC-EYE dataset and multiple external zero-shot benchmarks, the proposed approach achieves state-of-the-art performance, significantly enhancing the model’s visual grounding capability, in-domain accuracy, and cross-domain robustness.

Technology Category

Application Category

📝 Abstract

Vision--language models (VLMs) process images as visual tokens, yet their intermediate reasoning is often carried out in text, which can be suboptimal for visually grounded radiology tasks. Radiologists instead diagnose via sequential visual search; eye-tracking captures this process as time-ordered gaze trajectories that reveal how evidence is acquired over time. We use eye-gaze as supervision to guide VLM reasoning by introducing a small set of dedicated gaze tokens. These tokens are trained to predict gaze-selected image patch indices in temporal order, encouraging the model to follow human-like evidence acquisition and integration. Experiments on MIMIC-EYE and multiple external zero-shot benchmarks show consistent gains over baselines, achieving state-of-the-art in-domain performance and improved out-of-domain robustness. These results highlight temporally ordered gaze as an effective supervision signal for learning visually grounded medical reasoning.

Problem

Research questions and friction points this paper is trying to address.

medical VLMs

visual reasoning

eye-tracking

radiology

gaze supervision

Innovation

Methods, ideas, or system contributions that make the work stand out.

eye-tracking

visual reasoning

gaze supervision