🤖 AI Summary
Medical visual language models (VLMs) face a fundamental trade-off between reasoning accuracy and interpretability. To address this, we propose Structured Visual Chain-of-Thought (SV-CoT), a novel framework introducing S-Chain—the first large-scale, fine-grained, vision-language aligned medical visual chain-of-thought dataset—comprising 12K expert-annotated medical images and 700K multilingual visual question-answering (VQA) pairs, with explicit, structured alignment between multi-step reasoning steps and visual regions (e.g., lesion bounding boxes). Our method integrates medical knowledge-guided vision-language alignment, retrieval-augmented generation, and autoregressive reasoning. Extensive evaluation across state-of-the-art VLMs demonstrates that SV-CoT supervision significantly enhances model interpretability, visual grounding accuracy (+12.3%), and robustness to cross-lingual and cross-modal reasoning. This work establishes a new paradigm for trustworthy, clinically grounded AI.
📝 Abstract
Faithful reasoning in medical vision-language models (VLMs) requires not only accurate predictions but also transparent alignment between textual rationales and visual evidence. While Chain-of-Thought (CoT) prompting has shown promise in medical visual question answering (VQA), no large-scale expert-level dataset has captured stepwise reasoning with precise visual grounding. We introduce S-Chain, the first large-scale dataset of 12,000 expert-annotated medical images with bounding boxes and structured visual CoT (SV-CoT), explicitly linking visual regions to reasoning steps. The dataset further supports 16 languages, totaling over 700k VQA pairs for broad multilingual applicability. Using S-Chain, we benchmark state-of-the-art medical VLMs (ExGra-Med, LLaVA-Med) and general-purpose VLMs (Qwen2.5-VL, InternVL2.5), showing that SV-CoT supervision significantly improves interpretability, grounding fidelity, and robustness. Beyond benchmarking, we study its synergy with retrieval-augmented generation, revealing how domain knowledge and visual grounding interact during autoregressive reasoning. Finally, we propose a new mechanism that strengthens the alignment between visual evidence and reasoning, improving both reliability and efficiency. S-Chain establishes a new benchmark for grounded medical reasoning and paves the way toward more trustworthy and explainable medical VLMs.