PolypSegTrack: Unified Foundation Model for Colonoscopy Video Analysis

📅 2025-03-31

📈 Citations: 0

✨ Influential: 0

career value

161K/year

🤖 AI Summary

To address clinical needs for early polyp detection, precise segmentation, fine-grained classification, and cross-frame tracking in colonoscopy videos, this paper proposes the first end-to-end unified framework that requires neither task-specific fine-tuning nor medical-domain pretraining. Methodologically: (1) we introduce a conditional mask loss that accommodates multi-format annotations and enables joint optimization across tasks; (2) we design an object-query-based unsupervised tracking module, eliminating heuristic design and explicit motion modeling; (3) we directly leverage natural-image pretrained Vision Transformers (ViTs) as backbones, enabling effective cross-domain transfer. Evaluated on multiple mainstream polyp benchmarks, our method achieves state-of-the-art performance across all four tasks—detection, segmentation, classification, and tracking—demonstrating superior robustness and generalizability for clinical decision support.

Technology Category

Application Category

📝 Abstract

Early detection, accurate segmentation, classification and tracking of polyps during colonoscopy are critical for preventing colorectal cancer. Many existing deep-learning-based methods for analyzing colonoscopic videos either require task-specific fine-tuning, lack tracking capabilities, or rely on domain-specific pre-training. In this paper, we introduce extit{PolypSegTrack}, a novel foundation model that jointly addresses polyp detection, segmentation, classification and unsupervised tracking in colonoscopic videos. Our approach leverages a novel conditional mask loss, enabling flexible training across datasets with either pixel-level segmentation masks or bounding box annotations, allowing us to bypass task-specific fine-tuning. Our unsupervised tracking module reliably associates polyp instances across frames using object queries, without relying on any heuristics. We leverage a robust vision foundation model backbone that is pre-trained unsupervisedly on natural images, thereby removing the need for domain-specific pre-training. Extensive experiments on multiple polyp benchmarks demonstrate that our method significantly outperforms existing state-of-the-art approaches in detection, segmentation, classification, and tracking.

Problem

Research questions and friction points this paper is trying to address.

Detect and segment polyps in colonoscopy videos accurately

Classify and track polyps without task-specific fine-tuning

Eliminate need for domain-specific pre-training in polyp analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified model for polyp detection, segmentation, classification, tracking

Conditional mask loss for flexible training across datasets

Unsupervised tracking using object queries without heuristics

🔎 Similar Papers

A Survey on Deep Learning for Polyp Segmentation: Techniques, Challenges and Future Trends