ClinCoT: Clinical-Aware Visual Chain-of-Thought for Medical Vision Language Models

📅 2026-03-01

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This work addresses the prevalence of factuality hallucinations in medical vision-language models, which often arise from neglecting localized pathological evidence. Existing alignment methods focus solely on optimizing final outputs without providing visual guidance during intermediate reasoning steps. To bridge this gap, the authors propose ClinCoT, a novel framework that introduces visual chain-of-thought reasoning into medical multimodal inference. ClinCoT generates clinically plausible reasoning chains through hypothesis-driven region proposals and employs a multi-Med-LLM evaluator to automatically produce scored preference data. The framework further incorporates score-margin-aware optimization and a dynamic iterative learning mechanism to achieve fine-grained alignment at the regional reasoning trajectory level. Evaluated on three medical VQA and report generation benchmarks, ClinCoT significantly outperforms current preference alignment approaches in both factuality and overall performance.

Technology Category

Application Category

📝 Abstract

Medical Vision-Language Models have shown promising potential in clinical decision support, yet they remain prone to factual hallucinations due to insufficient grounding in localized pathological evidence. Existing medical alignment methods primarily operate at the response level through preference optimization, improving output correctness but leaving intermediate reasoning weakly connected to visual regions. Although chain-of-thought (CoT) enhances multimodal reasoning, it remains largely text-centric, limiting effective integration of clinical visual cues. To address this gap, we propose ClinCoT, a clinical-aware visual chain-of-thought framework that transforms preference optimization from response-level correction to visual-driven reasoning. We introduce an automatic data generation pipeline that constructs clinically grounded preference pairs through reasoning with hypotheses-driven region proposals. Multiple Med-LLMs evaluators rank and assign scores to each response, and these rankings serve as supervision to train the target model. We further introduce a scoring-based margin-aware optimization strategy that incorporates both preference ranking and score difference to refine region-level reasoning trajectories. To maintain alignment as the model's policy evolves during training, we adopt an iterative learning scheme that dynamically regenerates preference data. Extensive experiments on three medical VQA and report generation benchmarks demonstrate that ClinCoT consistently improves factual grounding and achieves superior performance compared with existing preference-based alignment methods.

Problem

Research questions and friction points this paper is trying to address.

medical vision-language models

factual hallucinations

visual grounding

chain-of-thought

clinical reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual Chain-of-Thought

Preference Optimization

Medical Vision-Language Models