GazeVaLM: A Multi-Observer Eye-Tracking Benchmark for Evaluating Clinical Realism in AI-Generated X-Rays

📅 2026-04-13
📈 Citations: 0
Influential: 0
📄 PDF

career value

191K/year
🤖 AI Summary
This study addresses the lack of a multi-observer eye-tracking benchmark for evaluating the clinical realism of AI-generated chest X-rays, which has hindered quantitative assessment of agreement between human experts and artificial intelligence in image perception and authenticity judgment. To bridge this gap, the authors construct a multimodal benchmark dataset comprising 960 eye-tracking records—including fixations, scanpaths, and saliency maps—from 16 radiologists diagnosing and discriminating the authenticity of 60 real and diffusion-model-generated chest X-rays, along with structured diagnostic labels. Concurrently, predictions from six state-of-the-art multimodal large language models on the same tasks are collected. This benchmark uniquely integrates human eye-movement behavior, clinical decisions, and AI outputs, enabling systematic comparison of human–AI alignment in diagnostic accuracy, authenticity detection, and uncertainty quantification, thereby extending the visual Turing test to clinical imaging.

Technology Category

Application Category

📝 Abstract
We introduce GazeVaLM, a public eye-tracking dataset for studying clinical perception during chest radiograph authenticity assessment. The dataset comprises 960 gaze recordings from 16 expert radiologists interpreting 30 real and 30 synthetic chest X-rays (generated by diffusion based generative AI) under two conditions: diagnostic assessment and real-fake classification (Visual Turing test). For each image-observer pair, we provide raw gaze samples, fixation maps, scanpaths, saliency density maps, structured diagnostic labels, and authenticity judgments. We extend the protocol to 6 state-of-the-art multimodal LLMs, releasing their predicted diagnoses, authenticity labels, and confidence scores under matched conditions - enabling direct human-AI comparison at both decision and uncertainty levels. We further provide analyses of gaze agreement, inter-observer consistency, and benchmarking of radiologists versus LLMs in diagnostic accuracy and authenticity detection. GazeVaLM supports research in gaze modeling, clinical decision-making, human-AI comparison, generative image realism assessment, and uncertainty quantification. By jointly releasing visual attention data, clinical labels, and model predictions, we aim to facilitate reproducible research on how experts and AI systems perceive, interpret, and evaluate medical images. The dataset is available at https://huggingface.co/datasets/davidcwong/GazeVaLM.
Problem

Research questions and friction points this paper is trying to address.

clinical realism
AI-generated X-rays
eye-tracking
Visual Turing test
medical image authenticity
Innovation

Methods, ideas, or system contributions that make the work stand out.

eye-tracking
multimodal LLMs
generative AI
clinical realism
Visual Turing test
🔎 Similar Papers
No similar papers found.
David Wong
David Wong
Northwestern University
Z
Zeynep Isik
Northwestern University
Bin Wang
Bin Wang
Northwestern University
Human-Centered AI
M
Marouane Tliba
Université Sorbonne Paris Nord
Gorkem Durak
Gorkem Durak
Northwestern University, Department of Radiology
radiologyartificial intelligence
Elif Keles
Elif Keles
Northwestern University
pediatricsneuroscienceneonatologyartificial intelligenceradiology
Halil Ertugrul Aktas
Halil Ertugrul Aktas
Department of Radiology, Northwestern University
RadiologyMRIArtificial Intelligence
Aladine Chetouani
Aladine Chetouani
Institut Galilée - L2TI - Multimedia Team
Image Quality AssessmentVideo AnalysisDepp LearningPattern Recognition
Cagdas Topel
Cagdas Topel
Northwestern University, Feinberg School of Medicine, Department of Radiology
Radiology
N
Nicolo Gennaro
Northwestern University
C
Camila Lopes Vendrami
Northwestern University
T
Tugce Agirlar Trabzonlu
Northwestern University
Amir Ali Rahsepar
Amir Ali Rahsepar
Northwestern University
Cardiothoracic Imaging
L
Laetitia Perronne
Northwestern University
M
Matthew Antalek
Northwestern University
O
Onural Ozturk
Northwestern University
G
Gokcan Okur
Loyola University Chicago
A
Andrew C. Gordon
Northwestern University
Ayis Pyrros
Ayis Pyrros
Neuroradiology, DuPage Medical Group
Radiologymachine learning
F
Frank H. Miller
Northwestern University
Amir Borhani
Amir Borhani
Associate Professor of Radiology, Northwestern University Feinberg School of Medicine
Abdominal ImagingLiver and Pancreaticobiliary Imaging
H
Hatice Savas
Northwestern University
E
Eric Hart
Northwestern University
E
Elizabeth Krupinski
Emory University
Ulas Bagci
Ulas Bagci
Northwestern University
artificial intelligencedeep learningbiomedical image analysismedical image computing