🤖 AI Summary
Current colonoscopy analysis relies predominantly on single-modality methods, exhibiting limited representational capacity and lacking systematic multimodal synergy. Method: We propose the first multimodal intelligent analysis framework for colonoscopy—ColonGPT—built upon a large-scale, colonoscopy-specific multimodal instruction dataset (ColonINST), a lightweight vision-language model, and a unified benchmark covering image classification, object detection, semantic segmentation, and vision-language understanding. Contribution/Results: This work introduces the first colonoscopy-oriented multimodal instruction-tuning paradigm; releases the open-source dataset ColonINST, model ColonGPT, and evaluation platform IntelliScope; and advances endoscopic analysis from unimodal perception toward multimodal understanding and clinical decision support—establishing a scalable technical foundation for intelligent colorectal cancer screening.
📝 Abstract
Colonoscopy is currently one of the most sensitive screening methods for colorectal cancer. This study investigates the frontiers of intelligent colonoscopy techniques and their prospective implications for multimodal medical applications. With this goal, we begin by assessing the current data-centric and model-centric landscapes through four tasks for colonoscopic scene perception, including classification, detection, segmentation, and vision-language understanding. This assessment enables us to identify domain-specific challenges and reveals that multimodal research in colonoscopy remains open for further exploration. To embrace the coming multimodal era, we establish three foundational initiatives: a large-scale multimodal instruction tuning dataset ColonINST, a colonoscopy-designed multimodal language model ColonGPT, and a multimodal benchmark. To facilitate ongoing monitoring of this rapidly evolving field, we provide a public website for the latest updates: https://github.com/ai4colonoscopy/IntelliScope.