A World Model of Radiologist Reading for Medical Image Representation Learning

📅 2026-05-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing approaches fail to effectively model the visual search and evidence accumulation processes embedded in radiologists’ eye-tracking data, often reducing them to static priors or auxiliary signals decoupled from diagnosis. This work proposes GazeWorld, which reframes medical image interpretation as a trajectory learning task within a world model: treating the image as an environment and gaze sequences as trajectories, it autoregressively predicts latent representations of subsequent fixation regions while incorporating a spatial completion branch to infer unvisited areas, thereby generating patch-sequence representations without requiring real eye-tracking data. Moving beyond conventional pretraining paradigms that solely optimize diagnostic outcomes, GazeWorld implicitly captures expert cognitive strategies. Experiments show that frozen GazeWorld features achieve state-of-the-art performance across all nine supervised tasks on CheXpert, RSNA, and SIIM-ACR benchmarks and excel in zero-shot settings; on the GazeSearch benchmark, its general-purpose decoder surpasses the specialized LogitGaze-Med model by 16% and 22% on ScanMatch and SED metrics, respectively.
📝 Abstract
Radiologist eye-tracking data provide a rich record of how experts search, compare, and accumulate evidence during image reading; yet, existing methods exploit this signal only partially, either as a static spatial prior or as an auxiliary prediction target decoupled from diagnosis. We propose GazeWorld, a medical imaging world model that treats the image as the world and the radiologist's fixation sequence as a trajectory through it. GazeWorld autoregressively predicts the latent representation of the next fixated patch from all previously visited ones, while a spatial-completion branch covers unvisited regions. At inference, GazeWorld generates a sequence of patch representations from the image alone without requiring real gaze data. Frozen GazeWorld features achieve state-of-the-art diagnostic accuracy across all nine supervised settings on CheXpert, RSNA Pneumonia, and SIIM-ACR Pneumothorax, as well as the highest zero-shot accuracy on all three benchmarks. On the GazeSearch benchmark, a generic decoder trained on the same frozen features outperforms the purpose-built LogitGaze-Med by over 16\% in ScanMatch and 22\% in SED, despite not being explicitly trained to predict gaze. GazeWorld demonstrates that modeling how experts read, not just what they conclude, offers a promising pretraining paradigm for medical imaging AI.
Problem

Research questions and friction points this paper is trying to address.

radiologist eye-tracking
medical image representation learning
expert reading behavior
gaze modeling
world model
Innovation

Methods, ideas, or system contributions that make the work stand out.

world model
gaze modeling
medical image representation learning
autoregressive prediction
zero-shot diagnosis