🤖 AI Summary
This work addresses temporal sentence grounding in videos, aiming to balance high performance with low annotation cost. To overcome limitations of existing fully supervised methods (high annotation burden) and weakly supervised alternatives (suboptimal performance), we propose a *partially supervised* setting—requiring only short segment-level annotations—and a two-stage implicit-to-explicit progressive localization framework. Our key contributions are: (1) the first quartet contrastive learning mechanism, jointly optimizing event-query alignment, event-background separation, intra-cluster compactness, and inter-cluster separability for fine-grained cross-modal alignment; and (2) a synergistic paradigm integrating implicit representation distillation with explicit pseudo-label refinement, enabling end-to-end two-stage training. Experiments on Charades-STA and ActivityNet Captions demonstrate that our method significantly outperforms weakly supervised baselines while approaching fully supervised performance, validating the effectiveness and practicality of partial supervision.
📝 Abstract
Temporal sentence grounding aims to detect event timestamps described by the natural language query from given untrimmed videos. The existing fully-supervised setting achieves great results but requires expensive annotation costs; while the weakly-supervised setting adopts cheap labels but performs poorly. To pursue high performance with less annotation costs, this paper introduces an intermediate partially-supervised setting, i.e., only short-clip is available during training. To make full use of partial labels, we specially design one contrast-unity framework, with the two-stage goal of implicit-explicit progressive grounding. In the implicit stage, we align event-query representations at fine granularity using comprehensive quadruple contrastive learning: event-query gather, event-background separation, intra-cluster compactness and inter-cluster separability. Then, high-quality representations bring acceptable grounding pseudo-labels. In the explicit stage, to explicitly optimize grounding objectives, we train one fully-supervised model using obtained pseudo-labels for grounding refinement and denoising. Extensive experiments and thoroughly ablations on Charades-STA and ActivityNet Captions demonstrate the significance of partial supervision, as well as our superior performance.