STPro: Spatial and Temporal Progressive Learning for Weakly Supervised Spatio-Temporal Grounding

๐Ÿ“… 2025-02-28
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

197K/year
๐Ÿค– AI Summary
This paper addresses weakly supervised spatio-temporal video grounding (WSTVG): localizing target objects in videos using only natural language queriesโ€”without bounding box annotations. To tackle challenges in modeling compositional actions and complex scenes, we propose a dual curriculum learning framework: (i) Sub-action Temporal Curriculum Learning (SA-TCL) to hierarchically model action structure over time, and (ii) Congestion-Guided Spatial Curriculum Learning (CG-SCL) to mitigate spatial ambiguity in crowded scenes. Furthermore, we introduce the Tubelet Referral Grounding (TRG) module, leveraging vision-language foundation models for fine-grained, tubelet-level referring localization. Our method achieves state-of-the-art performance on VidSTG-Declarative (+1.0% mAP) and HCSTVG-v1 (+3.0% mAP), marking the first approach to achieve high-accuracy joint spatio-temporal localization under purely query-only supervision.

Technology Category

Application Category

๐Ÿ“ Abstract
In this work we study Weakly Supervised Spatio-Temporal Video Grounding (WSTVG), a challenging task of localizing subjects spatio-temporally in videos using only textual queries and no bounding box supervision. Inspired by recent advances in vision-language foundation models, we investigate their utility for WSTVG, leveraging their zero-shot grounding capabilities. However, we find that a simple adaptation lacks essential spatio-temporal grounding abilities. To bridge this gap, we introduce Tubelet Referral Grounding (TRG), which connects textual queries to tubelets to enable spatio-temporal predictions. Despite its promise, TRG struggles with compositional action understanding and dense scene scenarios. To address these limitations, we propose STPro, a novel progressive learning framework with two key modules: (1) Sub-Action Temporal Curriculum Learning (SA-TCL), which incrementally builds compositional action understanding, and (2) Congestion-Guided Spatial Curriculum Learning (CG-SCL), which adapts the model to complex scenes by spatially increasing task difficulty. STPro achieves state-of-the-art results on three benchmark datasets, with improvements of 1.0% on VidSTG-Declarative and 3.0% on HCSTVG-v1.
Problem

Research questions and friction points this paper is trying to address.

Localizing subjects in videos using textual queries without bounding box supervision.
Enhancing spatio-temporal grounding through Tubelet Referral Grounding (TRG).
Improving compositional action understanding and dense scene handling with STPro framework.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Tubelet Referral Grounding for spatio-temporal predictions
Sub-Action Temporal Curriculum Learning for action understanding
Congestion-Guided Spatial Curriculum Learning for complex scenes
๐Ÿ”Ž Similar Papers