VinDr-CXR-VQA: A Visual Question Answering Dataset for Explainable Chest X-Ray Analysis with Multi-Task Learning

📅 2025-11-01

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

Medical visual question answering (Med-VQA) suffers from limited interpretability, poor spatial localization capability, and insufficient coverage of clinical intent. Method: We introduce CXR-ExplainVQA—the first explainable, spatially grounded Med-VQA dataset for chest X-ray analysis—comprising 4,394 images and 17,597 QA pairs, each annotated by radiologists with lesion bounding boxes and structured clinical reasoning texts. We propose a spatially aware six-category diagnostic question framework that explicitly models lesion localization and multi-step clinical reasoning to mitigate model hallucination. Contribution/Results: Leveraging CXR-ExplainVQA, we perform joint localization-reasoning training and benchmarking using large multimodal models (e.g., MedGemma-4B-it), achieving an F1 score of 0.624—11.8% higher than the baseline—while significantly improving lesion localization accuracy and clinical interpretability of answers.

Technology Category

Application Category

📝 Abstract

We present VinDr-CXR-VQA, a large-scale chest X-ray dataset for explainable Medical Visual Question Answering (Med-VQA) with spatial grounding. The dataset contains 17,597 question-answer pairs across 4,394 images, each annotated with radiologist-verified bounding boxes and clinical reasoning explanations. Our question taxonomy spans six diagnostic types-Where, What, Is there, How many, Which, and Yes/No-capturing diverse clinical intents. To improve reliability, we construct a balanced distribution of 41.7% positive and 58.3% negative samples, mitigating hallucinations in normal cases. Benchmarking with MedGemma-4B-it demonstrates improved performance (F1 = 0.624, +11.8% over baseline) while enabling lesion localization. VinDr-CXR-VQA aims to advance reproducible and clinically grounded Med-VQA research. The dataset and evaluation tools are publicly available at huggingface.co/datasets/Dangindev/VinDR-CXR-VQA.

Problem

Research questions and friction points this paper is trying to address.

Develops explainable medical VQA for chest X-ray analysis

Addresses dataset imbalance to reduce AI hallucinations

Enables lesion localization with multi-task learning approach

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-task learning for chest X-ray analysis

Spatial grounding with radiologist-verified bounding boxes

Balanced dataset distribution to mitigate hallucinations

🔎 Similar Papers

MedPromptX: Grounded Multimodal Prompting for Chest X-ray Diagnosis