DeepTumorVQA: A Hierarchical 3D CT Benchmark for Stage-Wise Evaluation of Medical VLMs and Tool-Augmented Agents

📅 2026-05-10
📈 Citations: 0
Influential: 0
📄 PDF

career value

178K/year
🤖 AI Summary
This work addresses the limitation of existing medical visual question answering (VQA) benchmarks, which report only overall accuracy and thus fail to pinpoint specific failure modes in multi-stage tumor diagnosis reasoning. To this end, we introduce the first hierarchical 3D CT VQA benchmark that decouples the diagnostic process into four independently evaluable stages: identification, measurement, visual reasoning, and medical reasoning. The benchmark incorporates a clinically grounded, evidence-chain-based phased evaluation framework, along with real-world tool-use trajectories and an interactive environment to support both direct reasoning and tool-augmented agent evaluation. Built upon 9,262 3D CT scans yielding 476K questions, our comprehensive assessment of over 30 model configurations reveals that quantitative measurement constitutes the primary performance bottleneck, that tool augmentation substantially improves accuracy, and that leveraging authentic tool-use trajectories effectively reduces reasoning errors.
📝 Abstract
Medical vision-language models (VLMs) and AI agents have made significant progress in learning to analyze and reason about clinical images. However, existing medical visual question answering (VQA) benchmarks collapse model capabilities into a single accuracy score, obscuring where and why models fail. We propose DeepTumorVQA, a hierarchical benchmark that follows the multi-stage evidence chain in tumor diagnosis and decomposes 3D CT reasoning into four stages: recognition, measurement, visual reasoning, and medical reasoning. Higher-level questions remain independently scorable, while their ground-truth evidence chains are defined over lower-level primitives. The benchmark contains 476K questions across 42 clinical subtypes on 9,262 3D CT volumes. In addition to a direct reasoning mode for VLMs, DeepTumorVQA provides tool-interaction environments for agent evaluation, where a model can call external tools, including segmentation models, measurement programs, and medical knowledge modules, before answering the question. Evaluating over 30 model configurations, we find that reliable quantitative measurement is the primary bottleneck, making later-stage visual and medical reasoning harder for VLMs, while tool augmentation substantially mitigates this issue. When tools are available, leveraging medical knowledge and tools to reason about medical images becomes a new challenge. We further show that ground-truth step-by-step tool-use traces from DeepTumorVQA can supervise agents and reduce tool-use and reasoning failures. This stage-wise progression from recognition to measurement to visual and medical reasoning provides a concrete roadmap for future medical VLM and AI agent studies. All data and code are released at https://github.com/Schuture/DeepTumorVQA.
Problem

Research questions and friction points this paper is trying to address.

medical VQA
stage-wise evaluation
3D CT reasoning
tumor diagnosis
visual-language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

hierarchical benchmark
stage-wise evaluation
tool-augmented agents
3D medical VQA
evidence chain reasoning
Yixiong Chen
Yixiong Chen
Johns Hopkins University
Vision Language ModelsComputer VisionMedical Image Analysis
W
Wenjie Xiao
Johns Hopkins University
P
Pedro R. A. S. Bassi
Johns Hopkins University, University of Bologna, Center for Biomolecular Nanotechnologies, Istituto Italiano di Tecnologia
B
Boyan Wang
The First Affiliated Hospital, Sun Yat-Sen University
L
Liang He
Tongji University
X
Xinze Zhou
Johns Hopkins University
S
Sezgin Er
Istanbul Medipol University
Ibrahim Ethem Hamamci
Ibrahim Ethem Hamamci
MD-PhD Student at University of Zurich | ETH AI Center
Medical Image AnalysisMachine Learning
Zongwei Zhou
Zongwei Zhou
Assistant Research Professor, Johns Hopkins University
Medical Image AnalysisBiomedical InformaticsImaging InformaticsComputer-aided Diagnosis
Alan Yuille
Alan Yuille
Professor of Cognitive Science and Computer Science, Johns Hopkins University
Computer VisionComputational Models of Mind and BrainMachine Learning