A Paradigm Shift: Fully End-to-End Training for Temporal Sentence Grounding in Videos

📅 2026-04-03

📈 Citations: 0

✨ Influential: 0

career value

164K/year

🤖 AI Summary

This work addresses the semantic gap between frozen classification-pretrained visual backbones and temporal sentence localization objectives in video grounding tasks. To bridge this gap, the authors propose a fully end-to-end training paradigm that jointly optimizes a learnable visual backbone and a localization head. They introduce a lightweight Sentence-Conditioned Adapter (SCADA) module that dynamically modulates visual representations using sentence features. The study provides the first systematic validation of end-to-end training efficacy across varying model scales, achieving state-of-the-art performance on two mainstream benchmarks. Notably, the proposed approach enables deployment with deeper network architectures while significantly reducing GPU memory consumption.

Technology Category

Application Category

📝 Abstract

Temporal sentence grounding in videos (TSGV) aims to localize a temporal segment that semantically corresponds to a sentence query from an untrimmed video. Most current methods adopt pre-trained query-agnostic visual encoders for offline feature extraction, and the video backbones are frozen and not optimized for TSGV. This leads to a task discrepancy issue for the video backbone trained for visual classification, but utilized for TSGV. To bridge this gap, we propose a fully end-to-end paradigm that jointly optimizes the video backbone and localization head. We first conduct an empirical study validating the effectiveness of end-to-end learning over frozen baselines across different model scales. Furthermore, we introduce a Sentence Conditioned Adapter (SCADA), which leverages sentence features to train a small portion of video backbone parameters adaptively. SCADA facilitates the deployment of deeper network backbones with reduced memory and significantly enhances visual representation by modulating feature maps through precise integration of linguistic embeddings. Experiments on two benchmarks show that our method outperforms state-of-the-art approaches. The code and models will be released.

Problem

Research questions and friction points this paper is trying to address.

Temporal Sentence Grounding

Task Discrepancy

Video Backbone

End-to-End Training

Query-Agnostic Encoder

Innovation

Methods, ideas, or system contributions that make the work stand out.

end-to-end training

temporal sentence grounding

video backbone optimization