🤖 AI Summary
MedVQA models suffer from low reliability in cross-domain deployment due to insufficient reliance on image evidence and poor adaptability without retraining or additional annotations. To address this, we propose a test-time adaptation (TTA) method that introduces Visual Chain-of-Reasoning (Visual CoT) during inference—without updating the frozen vision-language backbone. Our approach iteratively optimizes soft prompts to localize salient image regions; it then constructs a self-supervised signal by enforcing answer consistency between the full image and its localized crops. This enables plug-and-play adaptation with enhanced interpretability and evidence grounding. On pathVQA, our method improves closed-set accuracy of LLaVA by 12.3%, while significantly boosting cross-domain robustness and clinical utility.
📝 Abstract
Medical visual question answering could support clinical decision making, yet current systems often fail under domain shift and produce answers that are weakly grounded in image evidence. This reliability gap arises when models attend to spurious regions and when retraining or additional labels are impractical at deployment time. We address this setting with CoTBox-TTT, an evidence-first test-time training approach that adapts a vision-language model at inference while keeping all backbones frozen. The method updates only a small set of continuous soft prompts. It identifies question-relevant regions through a visual chain-of-thought signal and encourages answer consistency across the original image and a localized crop. The procedure is label free, and plug and play with diverse backbones. Experiments on medical VQA show that the approach is practical for real deployments. For instance, adding CoTBox-TTT to LLaVA increases closed-ended accuracy by 12.3% on pathVQA.