CoTBox-TTT: Grounding Medical VQA with Visual Chain-of-Thought Boxes During Test-time Training

📅 2025-11-15

📈 Citations: 0

✨ Influential: 0

career value

150K/year

🤖 AI Summary

MedVQA models suffer from low reliability in cross-domain deployment due to insufficient reliance on image evidence and poor adaptability without retraining or additional annotations. To address this, we propose a test-time adaptation (TTA) method that introduces Visual Chain-of-Reasoning (Visual CoT) during inference—without updating the frozen vision-language backbone. Our approach iteratively optimizes soft prompts to localize salient image regions; it then constructs a self-supervised signal by enforcing answer consistency between the full image and its localized crops. This enables plug-and-play adaptation with enhanced interpretability and evidence grounding. On pathVQA, our method improves closed-set accuracy of LLaVA by 12.3%, while significantly boosting cross-domain robustness and clinical utility.

Technology Category

Application Category

📝 Abstract

Medical visual question answering could support clinical decision making, yet current systems often fail under domain shift and produce answers that are weakly grounded in image evidence. This reliability gap arises when models attend to spurious regions and when retraining or additional labels are impractical at deployment time. We address this setting with CoTBox-TTT, an evidence-first test-time training approach that adapts a vision-language model at inference while keeping all backbones frozen. The method updates only a small set of continuous soft prompts. It identifies question-relevant regions through a visual chain-of-thought signal and encourages answer consistency across the original image and a localized crop. The procedure is label free, and plug and play with diverse backbones. Experiments on medical VQA show that the approach is practical for real deployments. For instance, adding CoTBox-TTT to LLaVA increases closed-ended accuracy by 12.3% on pathVQA.

Problem

Research questions and friction points this paper is trying to address.

Improving medical VQA reliability under domain shift

Enhancing answer grounding through visual evidence

Enabling test-time adaptation without retraining models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Updates only continuous soft prompts during inference

Uses visual chain-of-thought to identify relevant regions

Ensures answer consistency between original and localized images

🔎 Similar Papers

MedRG: Medical Report Grounding with Multi-modal Large Language Model