Colon-Bench: An Agentic Workflow for Scalable Dense Lesion Annotation in Full-Procedure Colonoscopy Videos

πŸ“… 2026-03-26
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the scarcity of large-scale, densely annotated multi-category long-sequence colonoscopy video datasets, which has hindered the application of multimodal large language models (MLLMs) in lesion recognition and understanding. To overcome this limitation, the authors develop a multi-agent collaborative workflow integrating temporal proposal generation, object tracking, AI-based visual verification, and human review to produce high-density annotations for 528 full-length colonoscopy videos. They release the first benchmark dataset comprising 14 lesion categories, over 300,000 bounding boxes, 213,000 segmentation masks, and 133,000 clinical description tokens. Furthermore, they introduce a β€œcolon-skill” prompting strategy that significantly enhances MLLM localization performance under zero-shot settings, achieving a maximum improvement of 9.7%.

Technology Category

Application Category

πŸ“ Abstract
Early screening via colonoscopy is critical for colon cancer prevention, yet developing robust AI systems for this domain is hindered by the lack of densely annotated, long-sequence video datasets. Existing datasets predominantly focus on single-class polyp detection and lack the rich spatial, temporal, and linguistic annotations required to evaluate modern Multimodal Large Language Models (MLLMs). To address this critical gap, we introduce Colon-Bench, generated via a novel multi-stage agentic workflow. Our pipeline seamlessly integrates temporal proposals, bounding-box tracking, AI-driven visual confirmation, and human-in-the-loop review to scalably annotate full-procedure videos. The resulting verified benchmark is unprecedented in scope, encompassing 528 videos, 14 distinct lesion categories (including polyps, ulcers, and bleeding), over 300,000 bounding boxes, 213,000 segmentation masks, and 133,000 words of clinical descriptions. We utilize Colon-Bench to rigorously evaluate state-of-the-art MLLMs across lesion classification, Open-Vocabulary Video Object Segmentation (OV-VOS), and video Visual Question Answering (VQA). The MLLM results demonstrate surprisingly high localization performance in medical domains compared to SAM-3. Finally, we analyze common VQA errors from MLLMs to introduce a novel "colon-skill" prompting strategy, improving zero-shot MLLM performance by up to 9.7% across most MLLMs. The dataset and the code are available at https://abdullahamdi.com/colon-bench .
Problem

Research questions and friction points this paper is trying to address.

dense annotation
colonoscopy videos
multimodal large language models
lesion detection
video dataset
Innovation

Methods, ideas, or system contributions that make the work stand out.

agentic workflow
dense lesion annotation
multimodal large language models
colon-skill prompting
Open-Vocabulary Video Object Segmentation
πŸ”Ž Similar Papers
No similar papers found.