Sparse-Dense Side-Tuner for efficient Video Temporal Grounding

📅 2025-07-10

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

In video temporal grounding (VTG), existing methods rely on frozen top-layer features of large models, limiting generalization; meanwhile, mainstream side-tuning approaches neglect the inherent sparsity of moment retrieval (MR). This paper proposes SDST, the first anchor-free side-tuning framework for VTG, featuring a novel sparse-dense collaborative architecture. SDST introduces reference-based deformable self-attention to enhance temporal contextual modeling and achieves, for the first time, efficient adaptation of InternVideo2 under the side-tuning paradigm. By combining parameter-efficient fine-tuning with sparse-dense feature fusion, SDST attains state-of-the-art or highly competitive performance on QVHighlights, TACoS, and Charades-STA—reducing trainable parameters by up to 73% while preserving accuracy, computational efficiency, and cross-domain adaptability.

Technology Category

Application Category

📝 Abstract

Video Temporal Grounding (VTG) involves Moment Retrieval (MR) and Highlight Detection (HD) based on textual queries. For this, most methods rely solely on final-layer features of frozen large pre-trained backbones, limiting their adaptability to new domains. While full fine-tuning is often impractical, parameter-efficient fine-tuning -- and particularly side-tuning (ST) -- has emerged as an effective alternative. However, prior ST approaches this problem from a frame-level refinement perspective, overlooking the inherent sparse nature of MR. To address this, we propose the Sparse-Dense Side-Tuner (SDST), the first anchor-free ST architecture for VTG. We also introduce the Reference-based Deformable Self-Attention, a novel mechanism that enhances the context modeling of the deformable attention -- a key limitation of existing anchor-free methods. Additionally, we present the first effective integration of InternVideo2 backbone into an ST framework, showing its profound implications in performance. Overall, our method significantly improves existing ST methods, achieving highly competitive or SOTA results on QVHighlights, TACoS, and Charades-STA, while reducing up to a 73% the parameter count w.r.t. the existing SOTA methods. The code is publicly accessible at https://github.com/davidpujol/SDST.

Problem

Research questions and friction points this paper is trying to address.

Improves Video Temporal Grounding with sparse-dense side-tuning

Enhances context modeling via deformable self-attention mechanism

Reduces parameter count while maintaining competitive performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse-Dense Side-Tuner for anchor-free VTG

Reference-based Deformable Self-Attention mechanism

Integration of InternVideo2 backbone efficiently

🔎 Similar Papers

Unified Static and Dynamic Network: Efficient Temporal Filtering for Video Grounding