CoTBox-TTT: Grounding Medical VQA with Visual Chain-of-Thought Boxes During Test-time Training

📅 2025-11-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
MedVQA models suffer from low reliability in cross-domain deployment due to insufficient reliance on image evidence and poor adaptability without retraining or additional annotations. To address this, we propose a test-time adaptation (TTA) method that introduces Visual Chain-of-Reasoning (Visual CoT) during inference—without updating the frozen vision-language backbone. Our approach iteratively optimizes soft prompts to localize salient image regions; it then constructs a self-supervised signal by enforcing answer consistency between the full image and its localized crops. This enables plug-and-play adaptation with enhanced interpretability and evidence grounding. On pathVQA, our method improves closed-set accuracy of LLaVA by 12.3%, while significantly boosting cross-domain robustness and clinical utility.

Technology Category

Application Category

📝 Abstract
Medical visual question answering could support clinical decision making, yet current systems often fail under domain shift and produce answers that are weakly grounded in image evidence. This reliability gap arises when models attend to spurious regions and when retraining or additional labels are impractical at deployment time. We address this setting with CoTBox-TTT, an evidence-first test-time training approach that adapts a vision-language model at inference while keeping all backbones frozen. The method updates only a small set of continuous soft prompts. It identifies question-relevant regions through a visual chain-of-thought signal and encourages answer consistency across the original image and a localized crop. The procedure is label free, and plug and play with diverse backbones. Experiments on medical VQA show that the approach is practical for real deployments. For instance, adding CoTBox-TTT to LLaVA increases closed-ended accuracy by 12.3% on pathVQA.
Problem

Research questions and friction points this paper is trying to address.

Improving medical VQA reliability under domain shift
Enhancing answer grounding through visual evidence
Enabling test-time adaptation without retraining models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Updates only continuous soft prompts during inference
Uses visual chain-of-thought to identify relevant regions
Ensures answer consistency between original and localized images
🔎 Similar Papers
No similar papers found.
J
Jiahe Qian
Institute of Automation, Chinese Academy of Sciences
Y
Yuhao Shen
School of Data Science, The Chinese University of Hong Kong, Shenzhen
Z
Zhangtianyi Chen
School of Data Science, The Chinese University of Hong Kong, Shenzhen
Juexiao Zhou
Juexiao Zhou
Assistant Professor, The Chinese University of Hong Kong, Shenzhen
AI for HealthcareEthical AIBioinformaticsPrivacyAGI
Peisong Wang
Peisong Wang
CASIA
Deep Neural Network Acceleration and Compression