RadImageNet-VQA: A Large-Scale CT and MRI Dataset for Radiologic Visual Question Answering

📅 2025-12-19

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

Existing medical visual question answering (VQA) datasets suffer from limited scale, narrow modality coverage (primarily X-ray or illustrations), and pervasive text-based shortcut biases. To address these limitations, we introduce RadImageNet-VQA—the first large-scale, expert-annotated VQA benchmark for CT and MRI scans, comprising 750K images and 7.5M diverse QA pairs spanning three core tasks: abnormality detection, anatomical identification, and fine-grained pathology classification. Our methodology features collaborative radiologist annotation, multi-round consistency validation, and adversarial linguistic analysis to rigorously eliminate textual shortcuts. RadImageNet-VQA is the first benchmark supporting open-ended generation, closed-ended answers, and multiple-choice questions, covering eight anatomical regions and 97 pathology classes. Empirical evaluation reveals that state-of-the-art vision-language models exhibit substantial performance deficits in open-set pathology recognition; offline ablation confirms their strict dependence on visual input, with negligible language-only bias.

Technology Category

Application Category

📝 Abstract

In this work, we introduce RadImageNet-VQA, a large-scale dataset designed to advance radiologic visual question answering (VQA) on CT and MRI exams. Existing medical VQA datasets are limited in scale, dominated by X-ray imaging or biomedical illustrations, and often prone to text-based shortcuts. RadImageNet-VQA is built from expert-curated annotations and provides 750K images paired with 7.5M question-answer samples. It covers three key tasks - abnormality detection, anatomy recognition, and pathology identification - spanning eight anatomical regions and 97 pathology categories, and supports open-ended, closed-ended, and multiple-choice questions. Extensive experiments show that state-of-the-art vision-language models still struggle with fine-grained pathology identification, particularly in open-ended settings and even after fine-tuning. Text-only analysis further reveals that model performance collapses to near-random without image inputs, confirming that RadImageNet-VQA is free from linguistic shortcuts. The full dataset and benchmark are publicly available at https://huggingface.co/datasets/raidium/RadImageNet-VQA.

Problem

Research questions and friction points this paper is trying to address.

Develops a large-scale CT/MRI dataset for radiologic visual question answering

Addresses limitations in existing medical VQA datasets' scale and modality coverage

Evaluates vision-language models on fine-grained pathology identification tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale CT and MRI dataset for radiologic VQA

Expert-curated annotations with 750K images and 7.5M QA pairs

Designed to avoid text shortcuts and support multiple question types

🔎 Similar Papers

RadCLIP: Enhancing Radiologic Image Analysis through Contrastive Language-Image Pre-training