RadEyeVideo: Enhancing general-domain Large Vision Language Model for chest X-ray analysis with video representations of eye gaze

📅 2025-07-11

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This study addresses the lack of clinical prior knowledge in Large Vision-Language Models (LVLMs) for chest X-ray (CXR) analysis. Methodologically, it models radiologists’ eye-tracking scanpaths as temporally ordered “gaze videos” and integrates them—without domain-specific fine-tuning—into general-purpose LVLMs supporting video input (e.g., LLaVA-OneVision). By leveraging real clinical reading videos, the approach achieves spatiotemporal alignment among gaze dynamics, medical images, and textual reports, preserving both the sequential nature and spatial focus of expert visual reasoning. The key contribution is the first incorporation of human gaze sequences as structured video priors into generic LVLMs, enabling implicit infusion of clinical visual reasoning patterns. Experiments demonstrate substantial improvements: up to 24.6% gain on key metrics for CXR report generation, with an average 15.2% improvement across two primary tasks. Notably, the enhanced generalist model surpasses state-of-the-art medical-specialized models—including MAIRA-2 and CheXagent—for the first time.

Technology Category

Application Category

📝 Abstract

Large Vision-Language Models (LVLMs) have demonstrated promising performance in chest X-ray (CXR) analysis. To enhance human-computer interaction, several studies have incorporated radiologists' eye gaze, typically through heatmaps or textual prompts. However, these methods often overlook the sequential order of eye movements, which could provide valuable insights by highlighting both the areas of interest and the order in which they are examined. In this work, we propose a novel approach called RadEyeVideo that integrates radiologists' eye-fixation data as a video sequence, capturing both the temporal and spatial dynamics of their gaze. We evaluate this method in CXR report generation and disease diagnosis using three general-domain, open-source LVLMs with video input capabilities. When prompted with eye-gaze videos, model performance improves by up to 24.6% in the report generation task and on average 15.2% for both tasks using scaled evaluation metrics. Notably, RadEyeVideo enhanced an open-domain LVLM model, LLaVA-OneVision, to surpass task-specific medical LVLMs such as MAIRA-2 and CheXagent, trained on large Chest X-ray data. This work highlights that domain expert's knowledge (eye-gaze information in this case), when effectively integrated with LVLMs, can significantly enhance general-domain models' capabilities in clinical tasks. RadEyeVideo is a step toward a scalable human-centered approach of utilizing LVLMs in medical image analytics.

Problem

Research questions and friction points this paper is trying to address.

Enhancing LVLMs for CXR analysis using eye-gaze video sequences

Capturing temporal and spatial dynamics of radiologists' eye movements

Improving report generation and disease diagnosis with gaze data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates eye-gaze data as video sequences

Enhances LVLMs with temporal-spatial gaze dynamics

Boosts performance in CXR analysis tasks

🔎 Similar Papers

LMOD: A Large Multimodal Ophthalmology Dataset and Benchmark for Large Vision-Language Models