🤖 AI Summary
This work addresses the limitation of existing medical visual question answering (VQA) benchmarks, which report only overall accuracy and thus fail to pinpoint specific failure modes in multi-stage tumor diagnosis reasoning. To this end, we introduce the first hierarchical 3D CT VQA benchmark that decouples the diagnostic process into four independently evaluable stages: identification, measurement, visual reasoning, and medical reasoning. The benchmark incorporates a clinically grounded, evidence-chain-based phased evaluation framework, along with real-world tool-use trajectories and an interactive environment to support both direct reasoning and tool-augmented agent evaluation. Built upon 9,262 3D CT scans yielding 476K questions, our comprehensive assessment of over 30 model configurations reveals that quantitative measurement constitutes the primary performance bottleneck, that tool augmentation substantially improves accuracy, and that leveraging authentic tool-use trajectories effectively reduces reasoning errors.
📝 Abstract
Medical vision-language models (VLMs) and AI agents have made significant progress in learning to analyze and reason about clinical images. However, existing medical visual question answering (VQA) benchmarks collapse model capabilities into a single accuracy score, obscuring where and why models fail. We propose DeepTumorVQA, a hierarchical benchmark that follows the multi-stage evidence chain in tumor diagnosis and decomposes 3D CT reasoning into four stages: recognition, measurement, visual reasoning, and medical reasoning. Higher-level questions remain independently scorable, while their ground-truth evidence chains are defined over lower-level primitives. The benchmark contains 476K questions across 42 clinical subtypes on 9,262 3D CT volumes. In addition to a direct reasoning mode for VLMs, DeepTumorVQA provides tool-interaction environments for agent evaluation, where a model can call external tools, including segmentation models, measurement programs, and medical knowledge modules, before answering the question. Evaluating over 30 model configurations, we find that reliable quantitative measurement is the primary bottleneck, making later-stage visual and medical reasoning harder for VLMs, while tool augmentation substantially mitigates this issue. When tools are available, leveraging medical knowledge and tools to reason about medical images becomes a new challenge. We further show that ground-truth step-by-step tool-use traces from DeepTumorVQA can supervise agents and reduce tool-use and reasoning failures. This stage-wise progression from recognition to measurement to visual and medical reasoning provides a concrete roadmap for future medical VLM and AI agent studies. All data and code are released at https://github.com/Schuture/DeepTumorVQA.