๐ค AI Summary
This paper addresses weakly supervised spatio-temporal video grounding (WSTVG): localizing target objects in videos using only natural language queriesโwithout bounding box annotations. To tackle challenges in modeling compositional actions and complex scenes, we propose a dual curriculum learning framework: (i) Sub-action Temporal Curriculum Learning (SA-TCL) to hierarchically model action structure over time, and (ii) Congestion-Guided Spatial Curriculum Learning (CG-SCL) to mitigate spatial ambiguity in crowded scenes. Furthermore, we introduce the Tubelet Referral Grounding (TRG) module, leveraging vision-language foundation models for fine-grained, tubelet-level referring localization. Our method achieves state-of-the-art performance on VidSTG-Declarative (+1.0% mAP) and HCSTVG-v1 (+3.0% mAP), marking the first approach to achieve high-accuracy joint spatio-temporal localization under purely query-only supervision.
๐ Abstract
In this work we study Weakly Supervised Spatio-Temporal Video Grounding (WSTVG), a challenging task of localizing subjects spatio-temporally in videos using only textual queries and no bounding box supervision. Inspired by recent advances in vision-language foundation models, we investigate their utility for WSTVG, leveraging their zero-shot grounding capabilities. However, we find that a simple adaptation lacks essential spatio-temporal grounding abilities. To bridge this gap, we introduce Tubelet Referral Grounding (TRG), which connects textual queries to tubelets to enable spatio-temporal predictions. Despite its promise, TRG struggles with compositional action understanding and dense scene scenarios. To address these limitations, we propose STPro, a novel progressive learning framework with two key modules: (1) Sub-Action Temporal Curriculum Learning (SA-TCL), which incrementally builds compositional action understanding, and (2) Congestion-Guided Spatial Curriculum Learning (CG-SCL), which adapts the model to complex scenes by spatially increasing task difficulty. STPro achieves state-of-the-art results on three benchmark datasets, with improvements of 1.0% on VidSTG-Declarative and 3.0% on HCSTVG-v1.