🤖 AI Summary
This work addresses the high computational cost of Vision Transformers (ViTs) in video instance segmentation, where existing pruning methods predominantly focus on token compression in deep layers and overlook the sparsification potential in early stages. To overcome this limitation, the authors propose a Video Patch Pruning (VPP) framework that introduces aggressive patch pruning at the earliest layers of ViTs for the first time. VPP leverages a differentiable temporal mapping module and a foreground-aware token selection mechanism to exploit temporal priors from deep features, guiding the retention of critical patches in early layers. The method surpasses the conventional ~30% sparsity ceiling of image-based pruning, achieving up to 60% patch pruning on YouTube-VIS 2021 with a performance drop of no more than 0.6%, significantly outperforming current state-of-the-art approaches.
📝 Abstract
Vision Transformers (ViTs) have demonstrated state-ofthe-art performance in several benchmarks, yet their high computational costs hinders their practical deployment. Patch Pruning offers significant savings, but existing approaches restrict token reduction to deeper layers, leaving early-stage compression unexplored. This limits their potential for holistic efficiency. In this work, we present a novel Video Patch Pruning framework (VPP) that integrates temporal prior knowledge to enable efficient sparsity within early ViT layers. Our approach is motivated by the observation that prior features extracted from deeper layers exhibit strong foreground selectivity. Therefore we propose a fully differentiable module for temporal mapping to accurately select the most relevant patches in early network stages. Notably, the proposed method enables a patch reduction of up to 60% in dense prediction tasks, exceeding the capabilities of conventional image-based patch pruning, which typically operate around a 30% patch sparsity. VPP excels the high-sparsity regime, sustaining remarkable performance even when patch usage is reduced below 55%. Specifically, it preserves stable results with a maximal performance drop of 0.6% on the Youtube-VIS 2021 dataset.