🤖 AI Summary
This work addresses the limitations of existing medical vision–language pretraining models, which fail to emulate radiologists’ diagnostic workflows by neglecting the critical roles of clinical context and gaze behavior in visual reasoning, leading to inadequate disease modeling and weak cross-modal alignment. To overcome these challenges, the authors propose CoGaze, a novel framework that, for the first time, incorporates clinical context and radiologist eye-tracking gaze data as probabilistic priors during pretraining. CoGaze features a context-enhanced visual encoder and a multi-level supervision mechanism comprising mixed positive-sample contrastive learning, disease-aware cross-modal representations, and gaze-guided attention. Experiments demonstrate that CoGaze significantly outperforms state-of-the-art methods across multiple tasks: structured report generation (CheXbert F1 +2.0%), free-text report generation (BLEU-2 +1.2%), zero-shot classification (AUROC +23.2%), and image–text retrieval (Precision@1 +12.2%).
📝 Abstract
Despite recent advances in medical vision-language pretraining, existing models still struggle to capture the diagnostic workflow: radiographs are typically treated as context-agnostic images, while radiologists' gaze -- a crucial cue for visual reasoning -- remains largely underexplored by existing methods. These limitations hinder the modeling of disease-specific patterns and weaken cross-modal alignment. To bridge this gap, we introduce CoGaze, a Context- and Gaze-guided vision-language pretraining framework for chest X-rays. We first propose a context-infused vision encoder that models how radiologists integrate clinical context -- including patient history, symptoms, and diagnostic intent -- to guide diagnostic reasoning. We then present a multi-level supervision paradigm that (1) enforces intra- and inter-modal semantic alignment through hybrid-positive contrastive learning, (2) injects diagnostic priors via disease-aware cross-modal representation learning, and (3) leverages radiologists' gaze as probabilistic priors to guide attention toward diagnostically salient regions. Extensive experiments demonstrate that CoGaze consistently outperforms state-of-the-art methods across diverse tasks, achieving up to +2.0% CheXbertF1 and +1.2% BLEU2 for free-text and structured report generation, +23.2% AUROC for zero-shot classification, and +12.2% Precision@1 for image-text retrieval. Code is available at https://github.com/mk-runner/CoGaze.