🤖 AI Summary
Existing video polyp segmentation (VPS) methods struggle to jointly optimize spatiotemporal modeling and domain generalization, while SAM2 suffers from error accumulation in long-term colonoscopy tracking, leading to performance degradation. To address these issues, we propose a training-free dual-module VPS framework that reformulates VPS as a “track-as-detect” task. It incorporates inline filtering to suppress false positives and an exline memory update mechanism for adaptive memory refinement—effectively mitigating the snowball effect. Our method deeply leverages SAM2’s spatiotemporal representation capability, achieving both intra-domain and cross-domain generalization solely through inference-time association filtering and memory modulation. Evaluated on untrimmed long colonoscopy videos, it significantly improves segmentation accuracy and temporal consistency, attaining state-of-the-art performance across multiple benchmarks. The framework demonstrates strong clinical deployment potential due to its zero-shot adaptability and computational efficiency.
📝 Abstract
Existing video polyp segmentation (VPS) paradigms usually struggle to balance between spatiotemporal modeling and domain generalization, limiting their applicability in real clinical scenarios. To embrace this challenge, we recast the VPS task as a track-by-detect paradigm that leverages the spatial contexts captured by the image polyp segmentation (IPS) model while integrating the temporal modeling capabilities of segment anything model 2 (SAM2). However, during long-term polyp tracking in colonoscopy videos, SAM2 suffers from error accumulation, resulting in a snowball effect that compromises segmentation stability. We mitigate this issue by repurposing SAM2 as a video polyp segmenter with two training-free modules. In particular, the intra-association filtering module eliminates spatial inaccuracies originating from the detecting stage, reducing false positives. The inter-association refinement module adaptively updates the memory bank to prevent error propagation over time, enhancing temporal coherence. Both modules work synergistically to stabilize SAM2, achieving cutting-edge performance in both in-domain and out-of-domain scenarios. Furthermore, we demonstrate the robust tracking capabilities of FreeVPS in long-untrimmed colonoscopy videos, underscoring its potential reliable clinical analysis.