Colon-X: Advancing Intelligent Colonoscopy from Multimodal Understanding to Clinical Reasoning

📅 2025-12-03

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

This work addresses the challenge of advancing multimodal understanding in colonoscopy toward clinically grounded reasoning. To this end, we introduce ColonVQA—the first colonoscopy-specific visual question answering dataset (1.1M+ samples)—and ColonReason, a clinical reasoning benchmark, both annotated via a novel multi-expert debate paradigm. Building upon these resources, we propose ColonR1, an R1-type reasoning model that incorporates task-adaptive reward mechanisms and gradient-stabilized optimization to enable robust decision-making under low-resource conditions. In data-scarce settings, ColonR1 achieves 56.61% accuracy—outperforming supervised fine-tuning by 25.22 percentage points. To our knowledge, this is the first framework to establish a closed-loop pipeline from perceptual multimodal understanding to interpretable, clinically actionable reasoning, thereby substantially improving the clinical trustworthiness and practical utility of colonoscopy AI systems.

Technology Category

Application Category

📝 Abstract

In this study, we present Colon-X, an open initiative aimed at advancing multimodal intelligence in colonoscopy. We begin by constructing ColonVQA, the most comprehensive multimodal dataset ever built for colonoscopy, featuring over 1.1M+ visual question answering entries across 76 clinical findings and 18 multimodal tasks. Beyond serving as a community-wide data foundation, we further investigate a critical yet underexplored transition in colonoscopy - evolving from multimodal understanding to clinical reasoning: (a) To capture the current landscape of multimodal understanding behaviors, we systematically assess the generalizability of 22 multimodal large language models and examine their reliability under human-induced perturbations. The results reveal that clinical outputs from leading MLLMs remain far from robust and trustworthy. (b) To narrow this gap, we further explore reasoning-centric intelligence tailored for colonoscopy. Specifically, we curate ColonReason, a clinically grounded reasoning dataset annotated through a multi-expert debating pipeline, and develop ColonR1, the first R1-styled model incorporating task-adaptive rewarding and gradient-stable optimization techniques. Under data-scarce conditions, our ColonR1 achieves 56.61% overall accuracy, outperforming supervised fine-tuning by 25.22%, and sets a new reasoning-enabled baseline for multimodal colonoscopy analysis. All data and model resources are publicly available at https://github.com/ai4colonoscopy/Colon-X.

Problem

Research questions and friction points this paper is trying to address.

Advancing multimodal intelligence in colonoscopy analysis

Assessing reliability of multimodal models under clinical perturbations

Developing reasoning-focused models for data-scarce clinical tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Constructed comprehensive multimodal dataset ColonVQA

Developed reasoning dataset ColonReason with expert debating pipeline

Created ColonR1 model using adaptive rewarding and stable optimization

🔎 Similar Papers

Deep Multimodal Collaborative Learning for Polyp Re-Identification