ReXVQA: A Large-scale Visual Question Answering Benchmark for Generalist Chest X-ray Understanding

📅 2025-06-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing chest X-ray visual question answering (VQA) benchmarks suffer from limited clinical diversity and insufficient coverage of diagnostic reasoning dimensions. Method: We introduce ReXVQA—the largest and most comprehensive chest X-ray VQA benchmark to date—comprising 696K questions across 160K radiographic studies, systematically covering five core radiological competencies: existence detection, spatial localization, negation recognition, differential diagnosis, and geometric reasoning. We develop an end-to-end multimodal large language model (MLLM) inference framework—including MedGemma-4B-it and Qwen2.5-VL—augmented with structured rationale generation and fine-grained evaluation metrics. Results: MedGemma achieves an overall accuracy of 83.24%, surpassing radiology residents (77.27%). ReXVQA’s dataset, public leaderboard, and category-wise analytical tools are fully open-sourced to advance robust, clinically grounded medical VQA research.

Technology Category

Application Category

📝 Abstract
We present ReXVQA, the largest and most comprehensive benchmark for visual question answering (VQA) in chest radiology, comprising approximately 696,000 questions paired with 160,000 chest X-rays studies across training, validation, and test sets. Unlike prior efforts that rely heavily on template based queries, ReXVQA introduces a diverse and clinically authentic task suite reflecting five core radiological reasoning skills: presence assessment, location analysis, negation detection, differential diagnosis, and geometric reasoning. We evaluate eight state-of-the-art multimodal large language models, including MedGemma-4B-it, Qwen2.5-VL, Janus-Pro-7B, and Eagle2-9B. The best-performing model (MedGemma) achieves 83.24% overall accuracy. To bridge the gap between AI performance and clinical expertise, we conducted a comprehensive human reader study involving 3 radiology residents on 200 randomly sampled cases. Our evaluation demonstrates that MedGemma achieved superior performance (83.84% accuracy) compared to human readers (best radiology resident: 77.27%), representing a significant milestone where AI performance exceeds expert human evaluation on chest X-ray interpretation. The reader study reveals distinct performance patterns between AI models and human experts, with strong inter-reader agreement among radiologists while showing more variable agreement patterns between human readers and AI models. ReXVQA establishes a new standard for evaluating generalist radiological AI systems, offering public leaderboards, fine-grained evaluation splits, structured explanations, and category-level breakdowns. This benchmark lays the foundation for next-generation AI systems capable of mimicking expert-level clinical reasoning beyond narrow pathology classification. Our dataset will be open-sourced at https://huggingface.co/datasets/rajpurkarlab/ReXVQA
Problem

Research questions and friction points this paper is trying to address.

Largest benchmark for chest X-ray VQA with 696K questions
Evaluates AI models on five core radiological reasoning skills
AI outperforms human experts in chest X-ray interpretation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Largest chest X-ray VQA benchmark with diverse tasks
Evaluated eight advanced multimodal language models
AI outperformed human experts in accuracy
🔎 Similar Papers
No similar papers found.
Ankit Pal
Ankit Pal
Saama AI Research
J
Jung-Oh Lee
Seoul National University
Xiaoman Zhang
Xiaoman Zhang
Harvard University
AI for MedicineMedical Image Analysis
M
Malaikannan Sankarasubbu
Saama AI Research
S
Seunghyeon Roh
Seoul National University
W
Won Jung Kim
Seoul National University
M
Meesun Lee
Seoul National University
P
Pranav Rajpurkar
Harvard Medical School