Colon-X: Advancing Intelligent Colonoscopy from Multimodal Understanding to Clinical Reasoning

📅 2025-12-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of advancing multimodal understanding in colonoscopy toward clinically grounded reasoning. To this end, we introduce ColonVQA—the first colonoscopy-specific visual question answering dataset (1.1M+ samples)—and ColonReason, a clinical reasoning benchmark, both annotated via a novel multi-expert debate paradigm. Building upon these resources, we propose ColonR1, an R1-type reasoning model that incorporates task-adaptive reward mechanisms and gradient-stabilized optimization to enable robust decision-making under low-resource conditions. In data-scarce settings, ColonR1 achieves 56.61% accuracy—outperforming supervised fine-tuning by 25.22 percentage points. To our knowledge, this is the first framework to establish a closed-loop pipeline from perceptual multimodal understanding to interpretable, clinically actionable reasoning, thereby substantially improving the clinical trustworthiness and practical utility of colonoscopy AI systems.

Technology Category

Application Category

📝 Abstract
In this study, we present Colon-X, an open initiative aimed at advancing multimodal intelligence in colonoscopy. We begin by constructing ColonVQA, the most comprehensive multimodal dataset ever built for colonoscopy, featuring over 1.1M+ visual question answering entries across 76 clinical findings and 18 multimodal tasks. Beyond serving as a community-wide data foundation, we further investigate a critical yet underexplored transition in colonoscopy - evolving from multimodal understanding to clinical reasoning: (a) To capture the current landscape of multimodal understanding behaviors, we systematically assess the generalizability of 22 multimodal large language models and examine their reliability under human-induced perturbations. The results reveal that clinical outputs from leading MLLMs remain far from robust and trustworthy. (b) To narrow this gap, we further explore reasoning-centric intelligence tailored for colonoscopy. Specifically, we curate ColonReason, a clinically grounded reasoning dataset annotated through a multi-expert debating pipeline, and develop ColonR1, the first R1-styled model incorporating task-adaptive rewarding and gradient-stable optimization techniques. Under data-scarce conditions, our ColonR1 achieves 56.61% overall accuracy, outperforming supervised fine-tuning by 25.22%, and sets a new reasoning-enabled baseline for multimodal colonoscopy analysis. All data and model resources are publicly available at https://github.com/ai4colonoscopy/Colon-X.
Problem

Research questions and friction points this paper is trying to address.

Advancing multimodal intelligence in colonoscopy analysis
Assessing reliability of multimodal models under clinical perturbations
Developing reasoning-focused models for data-scarce clinical tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Constructed comprehensive multimodal dataset ColonVQA
Developed reasoning dataset ColonReason with expert debating pipeline
Created ColonR1 model using adaptive rewarding and stable optimization
🔎 Similar Papers
No similar papers found.
G
Gepeng Ji
School of Computing, Australian National University
J
Jingyi Liu
VCIP, CS, Nankai University
D
Deng-Ping Fan
VCIP, CS, Nankai University
Nick Barnes
Nick Barnes
Professor, Australian National University
Computer Vision3D VisionSaliencyProsthetic visioncognitive vision