Knowing Your Target: Target-Aware Transformer Makes Better Spatio-Temporal Video Grounding

📅 2025-02-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing STVG methods employ zero-initialized object queries and rely solely on multimodal interaction to localize targets, leading to performance degradation under occlusion or clutter due to insufficient target-specific cues. To address this, we propose Target-Aware STVG (TA-STVG), the first framework introducing a cascaded Text-Guided Temporal Sampling (TTS) and Attribute-Sensitive Spatial Activation (ASA) module. TTS adaptively selects discriminative keyframes conditioned on textual descriptions, while ASA models fine-grained spatial attribute responses to enhance query specificity. Together, they generate semantically rich, spatiotemporally grounded object queries directly from video-text pairs. Evaluated on three mainstream benchmarks, TA-STVG achieves state-of-the-art performance, with particularly notable robustness improvements in complex scenes involving occlusion and background interference. Extensive ablations validate the effectiveness and generalizability of our target-aware query generation mechanism.

Technology Category

Application Category

📝 Abstract
Transformer has attracted increasing interest in STVG, owing to its end-to-end pipeline and promising result. Existing Transformer-based STVG approaches often leverage a set of object queries, which are initialized simply using zeros and then gradually learn target position information via iterative interactions with multimodal features, for spatial and temporal localization. Despite simplicity, these zero object queries, due to lacking target-specific cues, are hard to learn discriminative target information from interactions with multimodal features in complicated scenarios (e.g., with distractors or occlusion), resulting in degradation. Addressing this, we introduce a novel Target-Aware Transformer for STVG (TA-STVG), which seeks to adaptively generate object queries via exploring target-specific cues from the given video-text pair, for improving STVG. The key lies in two simple yet effective modules, comprising text-guided temporal sampling (TTS) and attribute-aware spatial activation (ASA), working in a cascade. The former focuses on selecting target-relevant temporal cues from a video utilizing holistic text information, while the latter aims at further exploiting the fine-grained visual attribute information of the object from previous target-aware temporal cues, which is applied for object query initialization. Compared to existing methods leveraging zero-initialized queries, object queries in our TA-STVG, directly generated from a given video-text pair, naturally carry target-specific cues, making them adaptive and better interact with multimodal features for learning more discriminative information to improve STVG. In our experiments on three benchmarks, TA-STVG achieves state-of-the-art performance and significantly outperforms the baseline, validating its efficacy.
Problem

Research questions and friction points this paper is trying to address.

Improves Spatio-Temporal Video Grounding
Generates target-aware object queries
Enhances discriminative information learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Target-Aware Transformer for STVG
Text-guided temporal sampling
Attribute-aware spatial activation
🔎 Similar Papers
No similar papers found.