V2T-CoT: From Vision to Text Chain-of-Thought for Medical Reasoning and Diagnosis

📅 2025-06-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing Med-VQA models predominantly rely on global image features, lacking lesion localization capability and neglecting interpretable diagnostic reasoning pathways. To address this, we propose a Vision-to-Text Chain-of-Thought (V2T-CoT) framework that jointly models pixel-level region attention with textual reasoning generation, enabling simultaneous lesion localization and diagnostic logic learning. We fine-tune vision-language models on our newly constructed R-Med 39K dataset and achieve state-of-the-art performance across four Med-VQA benchmarks. Our key contribution is the first deep integration of visual grounding into the chain-of-thought generation process—explicitly aligning visual regions with diagnostic rationales—which significantly improves both diagnostic accuracy and clinical interpretability of reasoning paths. This establishes a novel paradigm for AI-assisted medical decision-making that jointly optimizes precision and transparency.

Technology Category

Application Category

📝 Abstract
Recent advances in multimodal techniques have led to significant progress in Medical Visual Question Answering (Med-VQA). However, most existing models focus on global image features rather than localizing disease-specific regions crucial for diagnosis. Additionally, current research tends to emphasize answer accuracy at the expense of the reasoning pathway, yet both are crucial for clinical decision-making. To address these challenges, we propose From Vision to Text Chain-of-Thought (V2T-CoT), a novel approach that automates the localization of preference areas within biomedical images and incorporates this localization into region-level pixel attention as knowledge for Vision CoT. By fine-tuning the vision language model on constructed R-Med 39K dataset, V2T-CoT provides definitive medical reasoning paths. V2T-CoT integrates visual grounding with textual rationale generation to establish precise and explainable diagnostic results. Experimental results across four Med-VQA benchmarks demonstrate state-of-the-art performance, achieving substantial improvements in both performance and interpretability.
Problem

Research questions and friction points this paper is trying to address.

Focuses on disease-specific region localization in medical images
Integrates visual and textual reasoning for clinical decision-making
Improves accuracy and interpretability in medical diagnosis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automates disease-specific region localization in images
Integrates region-level pixel attention for Vision CoT
Combines visual grounding with textual rationale generation
🔎 Similar Papers
No similar papers found.
Y
Yuan Wang
Zhejiang Key Laboratory of Medical Imaging Artificial Intelligence, Zhejiang University, HangZhou, China
Jiaxiang Liu
Jiaxiang Liu
Zhejiang University
Multimodal FusionMedical Image Analysis
S
Shujian Gao
Academy for Engineering and Technology, Fudan University, Shanghai, China
B
Bin Feng
Department of Oral and Maxillofacial Radiology, Stomatology Hospital, School of Stomatology, Zhejiang University, Hangzhou, China
Zhihang Tang
Zhihang Tang
Zhejiang Lab | Tianjin University
Distributed computingHeterogeneous Computing
Xiaotang Gai
Xiaotang Gai
Zhejiang University
J
Jian Wu
Zhejiang Key Laboratory of Medical Imaging Artificial Intelligence, Zhejiang University, HangZhou, China
Zuozhu Liu
Zuozhu Liu
Assistant Professor, Zhejiang University/University of Illinois Urbana-Champaign
deep learningvision-language modelsmedical AI