🤖 AI Summary
In video temporal grounding (VTG), existing methods rely on frozen top-layer features of large models, limiting generalization; meanwhile, mainstream side-tuning approaches neglect the inherent sparsity of moment retrieval (MR). This paper proposes SDST, the first anchor-free side-tuning framework for VTG, featuring a novel sparse-dense collaborative architecture. SDST introduces reference-based deformable self-attention to enhance temporal contextual modeling and achieves, for the first time, efficient adaptation of InternVideo2 under the side-tuning paradigm. By combining parameter-efficient fine-tuning with sparse-dense feature fusion, SDST attains state-of-the-art or highly competitive performance on QVHighlights, TACoS, and Charades-STA—reducing trainable parameters by up to 73% while preserving accuracy, computational efficiency, and cross-domain adaptability.
📝 Abstract
Video Temporal Grounding (VTG) involves Moment Retrieval (MR) and Highlight Detection (HD) based on textual queries. For this, most methods rely solely on final-layer features of frozen large pre-trained backbones, limiting their adaptability to new domains. While full fine-tuning is often impractical, parameter-efficient fine-tuning -- and particularly side-tuning (ST) -- has emerged as an effective alternative. However, prior ST approaches this problem from a frame-level refinement perspective, overlooking the inherent sparse nature of MR. To address this, we propose the Sparse-Dense Side-Tuner (SDST), the first anchor-free ST architecture for VTG. We also introduce the Reference-based Deformable Self-Attention, a novel mechanism that enhances the context modeling of the deformable attention -- a key limitation of existing anchor-free methods. Additionally, we present the first effective integration of InternVideo2 backbone into an ST framework, showing its profound implications in performance. Overall, our method significantly improves existing ST methods, achieving highly competitive or SOTA results on QVHighlights, TACoS, and Charades-STA, while reducing up to a 73% the parameter count w.r.t. the existing SOTA methods. The code is publicly accessible at https://github.com/davidpujol/SDST.