S-Chain: Structured Visual Chain-of-Thought For Medicine

📅 2025-10-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Medical visual language models (VLMs) face a fundamental trade-off between reasoning accuracy and interpretability. To address this, we propose Structured Visual Chain-of-Thought (SV-CoT), a novel framework introducing S-Chain—the first large-scale, fine-grained, vision-language aligned medical visual chain-of-thought dataset—comprising 12K expert-annotated medical images and 700K multilingual visual question-answering (VQA) pairs, with explicit, structured alignment between multi-step reasoning steps and visual regions (e.g., lesion bounding boxes). Our method integrates medical knowledge-guided vision-language alignment, retrieval-augmented generation, and autoregressive reasoning. Extensive evaluation across state-of-the-art VLMs demonstrates that SV-CoT supervision significantly enhances model interpretability, visual grounding accuracy (+12.3%), and robustness to cross-lingual and cross-modal reasoning. This work establishes a new paradigm for trustworthy, clinically grounded AI.

Technology Category

Application Category

📝 Abstract
Faithful reasoning in medical vision-language models (VLMs) requires not only accurate predictions but also transparent alignment between textual rationales and visual evidence. While Chain-of-Thought (CoT) prompting has shown promise in medical visual question answering (VQA), no large-scale expert-level dataset has captured stepwise reasoning with precise visual grounding. We introduce S-Chain, the first large-scale dataset of 12,000 expert-annotated medical images with bounding boxes and structured visual CoT (SV-CoT), explicitly linking visual regions to reasoning steps. The dataset further supports 16 languages, totaling over 700k VQA pairs for broad multilingual applicability. Using S-Chain, we benchmark state-of-the-art medical VLMs (ExGra-Med, LLaVA-Med) and general-purpose VLMs (Qwen2.5-VL, InternVL2.5), showing that SV-CoT supervision significantly improves interpretability, grounding fidelity, and robustness. Beyond benchmarking, we study its synergy with retrieval-augmented generation, revealing how domain knowledge and visual grounding interact during autoregressive reasoning. Finally, we propose a new mechanism that strengthens the alignment between visual evidence and reasoning, improving both reliability and efficiency. S-Chain establishes a new benchmark for grounded medical reasoning and paves the way toward more trustworthy and explainable medical VLMs.
Problem

Research questions and friction points this paper is trying to address.

Creating structured visual reasoning dataset for medical image analysis
Improving interpretability and visual grounding in medical VLMs
Establishing benchmark for trustworthy medical visual question answering
Innovation

Methods, ideas, or system contributions that make the work stand out.

Structured visual CoT dataset with bounding boxes
Multilingual VQA pairs for broad applicability
Mechanism strengthening visual evidence and reasoning alignment
🔎 Similar Papers
No similar papers found.
Khai Le-Duc
Khai Le-Duc
University of Toronto
Artificial IntelligenceHeal the world
D
Duy M. H. Nguyen
German Research Centre for Artificial Intelligence
P
Phuong T. H. Trinh
Chonnam National University, South Korea
Tien-Phat Nguyen
Tien-Phat Nguyen
Unknown affiliation
Computer VisionDomain AdaptationMedical Imaging
N
Nghiem T. Diep
German Research Centre for Artificial Intelligence
A
An Ngo
Bucknell University, USA
T
Tung Vu
Concordia University, Canada
T
Trinh Vuong
Korea University
A
Anh-Tien Nguyen
Justus Liebig University Giessen, Germany
M
Mau Nguyen
Japan Advanced Institute of Science and Technology
V
Van Trung Hoang
Hue University, Vietnam
K
Khai-Nguyen Nguyen
College of William & Mary, USA
H
Hy Nguyen
Deakin University, Australia
Chris Ngo
Chris Ngo
Knovel Engineering
Anji Liu
Anji Liu
Assistant Professor, National University of Singapore
Machine LearningGenerative ModelsProbabilistic Circuits
Nhat Ho
Nhat Ho
Assistant Professor at University of Texas, Austin
Machine LearningBayesian StatisticsOptimizationOptimal TransportDeep Learning
Anne-Christin Hauschild
Anne-Christin Hauschild
University Professor at Justus-Liebig University Gießen
Machine LearningExplainable AIBioinformaticsBiomedical Data ScienceBiostatistics
K
Khanh Xuan Nguyen
University of California, Berkeley, USA
Thanh Nguyen-Tang
Thanh Nguyen-Tang
Johns Hopkins University
Machine Learning
Pengtao Xie
Pengtao Xie
Associate Professor, UC San Diego; Adjunct Faculty, MBZUAI
Machine Learning
Daniel Sonntag
Daniel Sonntag
DFKI and University of Oldenburg
Interactive Machine LearningIntelligent User InterfacesMultimodal Interaction
James Zou
James Zou
Stanford University
Machine learningcomputational biologycomputational healthstatisticsbiotech
Mathias Niepert
Mathias Niepert
University of Stuttgart & NEC Labs Europe
Machine learning
Anh Totti Nguyen
Anh Totti Nguyen
Associate Professor, Auburn University
Machine LearningExplainable AIComputer VisionNLP