🤖 AI Summary
To address clinical needs for early polyp detection, precise segmentation, fine-grained classification, and cross-frame tracking in colonoscopy videos, this paper proposes the first end-to-end unified framework that requires neither task-specific fine-tuning nor medical-domain pretraining. Methodologically: (1) we introduce a conditional mask loss that accommodates multi-format annotations and enables joint optimization across tasks; (2) we design an object-query-based unsupervised tracking module, eliminating heuristic design and explicit motion modeling; (3) we directly leverage natural-image pretrained Vision Transformers (ViTs) as backbones, enabling effective cross-domain transfer. Evaluated on multiple mainstream polyp benchmarks, our method achieves state-of-the-art performance across all four tasks—detection, segmentation, classification, and tracking—demonstrating superior robustness and generalizability for clinical decision support.
📝 Abstract
Early detection, accurate segmentation, classification and tracking of polyps during colonoscopy are critical for preventing colorectal cancer. Many existing deep-learning-based methods for analyzing colonoscopic videos either require task-specific fine-tuning, lack tracking capabilities, or rely on domain-specific pre-training. In this paper, we introduce extit{PolypSegTrack}, a novel foundation model that jointly addresses polyp detection, segmentation, classification and unsupervised tracking in colonoscopic videos. Our approach leverages a novel conditional mask loss, enabling flexible training across datasets with either pixel-level segmentation masks or bounding box annotations, allowing us to bypass task-specific fine-tuning. Our unsupervised tracking module reliably associates polyp instances across frames using object queries, without relying on any heuristics. We leverage a robust vision foundation model backbone that is pre-trained unsupervisedly on natural images, thereby removing the need for domain-specific pre-training. Extensive experiments on multiple polyp benchmarks demonstrate that our method significantly outperforms existing state-of-the-art approaches in detection, segmentation, classification, and tracking.