Clinical Cognition Alignment for Gastrointestinal Diagnosis with Multimodal LLMs

📅 2026-03-21

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

This work addresses the tendency of multimodal large language models to deviate from clinical reasoning pathways and be misled by visual spurious correlations in gastrointestinal endoscopy diagnosis. To bridge this gap, the authors propose the Clinical Cognitive Alignment (CogAlign) framework, which first employs a hierarchically structured clinical cognition dataset for supervised fine-tuning to internalize expert diagnostic logic within the model. Subsequently, counterfactual reinforcement learning is introduced, leveraging clinical cognition–based rewards to steer the model toward focusing on causal lesion features rather than superficial correlates. This approach uniquely integrates structured clinical reasoning with counterfactual causal mechanisms into multimodal foundation models, achieving state-of-the-art performance across multiple gastrointestinal diagnostic benchmarks and significantly improving diagnostic accuracy in complex clinical scenarios.

Technology Category

Application Category

📝 Abstract

Multimodal Large Language Models (MLLMs) have demonstrated remarkable potential in medical image analysis. However, their application in gastrointestinal endoscopy is currently hindered by two critical limitations: the misalignment between general model reasoning and standardized clinical cognitive pathways, and the lack of causal association between visual features and diagnostic outcomes. In this paper, we propose a novel Clinical-Cognitive-Aligned (CogAlign) framework to address these challenges. First, we endow the model with rigorous clinical analytical capabilities by constructing the hierarchical clinical cognition dataset and employing Supervised Fine-Tuning (SFT). Unlike conventional approaches, this strategy internalizes the hierarchical diagnostic logic of experts, ranging from anatomical localization and morphological evaluation to microvascular analysis, directly into the model. Second, to eliminate visual bias, we provide a theoretical analysis demonstrating that standard supervised tuning inevitably converges to spurious background correlations. Guided by this insight, we propose a counterfactual-driven reinforcement learning strategy to enforce causal rectification. By generating counterfactual normal samples via lesion masking and optimizing through clinical-cognition-centric rewards, we constrain the model to strictly ground its diagnosis in causal lesion features. Extensive experiments demonstrate that our approach achieves State-of-the-Art (SoTA) performance across multiple benchmarks, significantly enhancing diagnostic accuracy in complex clinical scenarios. All source code and datasets will be made publicly available.

Problem

Research questions and friction points this paper is trying to address.

Clinical Cognition Alignment

Multimodal LLMs

Gastrointestinal Diagnosis

Causal Association

Visual Bias

Innovation

Methods, ideas, or system contributions that make the work stand out.

Clinical Cognition Alignment

Multimodal LLMs

Counterfactual Reinforcement Learning