Think Twice to See More: Iterative Visual Reasoning in Medical VLMs

📅 2025-10-11

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

Medical vision-language models (VLMs) typically rely on single-pass forward inference, failing to emulate human experts’ iterative, multi-round focus and refinement on lesion regions—resulting in a gap between perceptual capability and clinical diagnostic reasoning. To address this, we propose ViTAR, the first framework that explicitly models expert diagnostic behavior as a learnable “Think–Act–Re-think–Answer” cognitive chain, enabling interactive, fine-grained visual reasoning. Methodologically, we curate a high-quality interactive dataset and adopt a two-stage training paradigm: supervised fine-tuning to guide the cognitive trajectory, followed by reinforcement learning to optimize sequential decision-making. ViTAR jointly localizes lesions and grounds semantic understanding via high-precision VQA annotations and attention-guided alignment. On multiple medical VLM benchmarks, ViTAR significantly outperforms state-of-the-art methods. Attention visualization confirms progressive convergence onto clinically critical regions across iterations, demonstrating simultaneous improvements in diagnostic accuracy and interpretability.

Technology Category

Application Category

📝 Abstract

Medical vision-language models (VLMs) excel at image-text understanding but typically rely on a single-pass reasoning that neglects localized visual cues. In clinical practice, however, human experts iteratively scan, focus, and refine the regions of interest before reaching a final diagnosis. To narrow this machine-human perception gap, we introduce ViTAR, a novel VLM framework that emulates the iterative reasoning process of human experts through a cognitive chain of "think-act-rethink-answer". ViTAR treats medical images as interactive objects, enabling models to engage multi-step visual reasoning. To support this approach, we curate a high-quality instruction dataset comprising 1K interactive examples that encode expert-like diagnostic behaviors. In addition, a 16K visual question answering training data has been curated towards fine-grained visual diagnosis. We introduce a two-stage training strategy that begins with supervised fine-tuning to guide cognitive trajectories, followed by the reinforcement learning to optimize decision-making. Extensive evaluations demonstrate that ViTAR outperforms strong state-of-the-art models. Visual attention analysis reveals that from the "think" to "rethink" rounds, ViTAR increasingly anchors visual grounding to clinically critical regions and maintains high attention allocation to visual tokens during reasoning, providing mechanistic insight into its improved performance. These findings demonstrate that embedding expert-style iterative thinking chains into VLMs enhances both performance and trustworthiness of medical AI.

Problem

Research questions and friction points this paper is trying to address.

Addressing single-pass reasoning limitations in medical vision-language models

Bridging machine-human perception gap through iterative visual reasoning

Enhancing medical AI performance by emulating expert diagnostic behaviors

Innovation

Methods, ideas, or system contributions that make the work stand out.

Iterative reasoning mimics human diagnostic process

Two-stage training with fine-tuning and reinforcement learning

Visual attention anchors to clinically critical regions

🔎 Similar Papers

No similar papers found.