PolypSegTrack: Unified Foundation Model for Colonoscopy Video Analysis

📅 2025-03-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address clinical needs for early polyp detection, precise segmentation, fine-grained classification, and cross-frame tracking in colonoscopy videos, this paper proposes the first end-to-end unified framework that requires neither task-specific fine-tuning nor medical-domain pretraining. Methodologically: (1) we introduce a conditional mask loss that accommodates multi-format annotations and enables joint optimization across tasks; (2) we design an object-query-based unsupervised tracking module, eliminating heuristic design and explicit motion modeling; (3) we directly leverage natural-image pretrained Vision Transformers (ViTs) as backbones, enabling effective cross-domain transfer. Evaluated on multiple mainstream polyp benchmarks, our method achieves state-of-the-art performance across all four tasks—detection, segmentation, classification, and tracking—demonstrating superior robustness and generalizability for clinical decision support.

Technology Category

Application Category

📝 Abstract
Early detection, accurate segmentation, classification and tracking of polyps during colonoscopy are critical for preventing colorectal cancer. Many existing deep-learning-based methods for analyzing colonoscopic videos either require task-specific fine-tuning, lack tracking capabilities, or rely on domain-specific pre-training. In this paper, we introduce extit{PolypSegTrack}, a novel foundation model that jointly addresses polyp detection, segmentation, classification and unsupervised tracking in colonoscopic videos. Our approach leverages a novel conditional mask loss, enabling flexible training across datasets with either pixel-level segmentation masks or bounding box annotations, allowing us to bypass task-specific fine-tuning. Our unsupervised tracking module reliably associates polyp instances across frames using object queries, without relying on any heuristics. We leverage a robust vision foundation model backbone that is pre-trained unsupervisedly on natural images, thereby removing the need for domain-specific pre-training. Extensive experiments on multiple polyp benchmarks demonstrate that our method significantly outperforms existing state-of-the-art approaches in detection, segmentation, classification, and tracking.
Problem

Research questions and friction points this paper is trying to address.

Detect and segment polyps in colonoscopy videos accurately
Classify and track polyps without task-specific fine-tuning
Eliminate need for domain-specific pre-training in polyp analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified model for polyp detection, segmentation, classification, tracking
Conditional mask loss for flexible training across datasets
Unsupervised tracking using object queries without heuristics
🔎 Similar Papers
No similar papers found.