Temporal Cluster Assignment for Efficient Real-Time Video Segmentation

📅 2025-08-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Swin Transformers incur substantial computational overhead in real-time video segmentation, and existing pruning and training-free token clustering methods struggle to simultaneously satisfy windowed attention constraints and exploit temporal redundancy. Method: This paper proposes a fine-tuning-free, temporally aware token clustering method that explicitly models inter-frame temporal consistency within a training-free clustering framework. A novel temporal clustering assignment strategy dynamically aggregates semantically similar tokens across frames while strictly preserving the intrinsic windowed attention structure of Swin. The method is fully architecture-compatible—requiring no weight modification or retraining. Results: Evaluated on multiple public and private surgical video datasets, our approach achieves 38–52% faster inference over baseline methods, with negligible accuracy degradation (mIoU drop < 0.6%), striking an excellent trade-off between efficiency and segmentation accuracy.

Technology Category

Application Category

📝 Abstract
Vision Transformers have substantially advanced the capabilities of segmentation models across both image and video domains. Among them, the Swin Transformer stands out for its ability to capture hierarchical, multi-scale representations, making it a popular backbone for segmentation in videos. However, despite its window-attention scheme, it still incurs a high computational cost, especially in larger variants commonly used for dense prediction in videos. This remains a major bottleneck for real-time, resource-constrained applications. Whilst token reduction methods have been proposed to alleviate this, the window-based attention mechanism of Swin requires a fixed number of tokens per window, limiting the applicability of conventional pruning techniques. Meanwhile, training-free token clustering approaches have shown promise in image segmentation while maintaining window consistency. Nevertheless, they fail to exploit temporal redundancy, missing a key opportunity to further optimize video segmentation performance. We introduce Temporal Cluster Assignment (TCA), a lightweight and effective, fine-tuning-free strategy that enhances token clustering by leveraging temporal coherence across frames. Instead of indiscriminately dropping redundant tokens, TCA refines token clusters using temporal correlations, thereby retaining fine-grained details while significantly reducing computation. Extensive evaluations on YouTube-VIS 2019, YouTube-VIS 2021, OVIS, and a private surgical video dataset show that TCA consistently boosts the accuracy-speed trade-off of existing clustering-based methods. Our results demonstrate that TCA generalizes competently across both natural and domain-specific videos.
Problem

Research questions and friction points this paper is trying to address.

High computational cost in Swin Transformer for video segmentation
Limited applicability of token reduction methods in window-based attention
Inefficient exploitation of temporal redundancy in video token clustering
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages temporal coherence for token clustering
Refines clusters using temporal correlations
Maintains fine-grained details while reducing computation
🔎 Similar Papers
No similar papers found.