3D-RAD: A Comprehensive 3D Radiology Med-VQA Dataset with Multi-Temporal Analysis and Diverse Diagnostic Tasks

📅 2025-06-11

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Existing Med-VQA research predominantly focuses on 2D medical images and addresses isolated tasks, limiting its utility for clinical 3D diagnostic decision-making. To bridge this gap, we introduce 3D-RAD, the first large-scale, radiology-specific 3D Med-VQA benchmark, built upon CT volumes and covering six complex diagnostic tasks: abnormality detection, multi-phase longitudinal reasoning, medical computation, and more. We propose a novel dual-modal temporal diagnosis paradigm—integrating static and longitudinal reasoning—and formally define multi-stage inference challenges. Additionally, we release 3D-RAD-T, a high-quality, expert-annotated subset of 136K question-answer pairs aligned with clinical expertise. Our work unifies 3D medical image analysis, vision-language modeling, temporal reasoning, and expert-informed VQA construction. Extensive experiments reveal that state-of-the-art medical vision-language models exhibit severe generalization deficits on multi-phase tasks; however, fine-tuning yields substantial performance gains. The dataset and code are publicly available.

Technology Category

Application Category

📝 Abstract

Medical Visual Question Answering (Med-VQA) holds significant potential for clinical decision support, yet existing efforts primarily focus on 2D imaging with limited task diversity. This paper presents 3D-RAD, a large-scale dataset designed to advance 3D Med-VQA using radiology CT scans. The 3D-RAD dataset encompasses six diverse VQA tasks: anomaly detection, image observation, medical computation, existence detection, static temporal diagnosis, and longitudinal temporal diagnosis. It supports both open- and closed-ended questions while introducing complex reasoning challenges, including computational tasks and multi-stage temporal analysis, to enable comprehensive benchmarking. Extensive evaluations demonstrate that existing vision-language models (VLMs), especially medical VLMs exhibit limited generalization, particularly in multi-temporal tasks, underscoring the challenges of real-world 3D diagnostic reasoning. To drive future advancements, we release a high-quality training set 3D-RAD-T of 136,195 expert-aligned samples, showing that fine-tuning on this dataset could significantly enhance model performance. Our dataset and code, aiming to catalyze multimodal medical AI research and establish a robust foundation for 3D medical visual understanding, are publicly available at https://github.com/Tang-xiaoxiao/M3D-RAD.

Problem

Research questions and friction points this paper is trying to address.

Addressing limited task diversity in 2D Med-VQA with 3D CT scans

Enhancing model generalization for multi-temporal diagnostic reasoning tasks

Providing a comprehensive dataset to advance 3D medical visual understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale 3D CT dataset for Med-VQA

Supports multi-temporal and diverse diagnostic tasks

Fine-tuning enhances model performance significantly

🔎 Similar Papers

RadCLIP: Enhancing Radiologic Image Analysis through Contrastive Language-Image Pre-training