Bridging Time and Space: Decoupled Spatio-Temporal Alignment for Video Grounding

📅 2026-04-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses two key challenges in video grounding—tight spatiotemporal alignment coupling and visual token redundancy—by proposing the Bridge-STG framework, which decouples temporal and spatial localization tasks to enable heterogeneous subtask optimization while preserving semantic consistency. The core innovations include a Spatio-Temporal Semantic Bridging (STSB) mechanism to bridge the semantic gap introduced by decoupling, and a Query-Guided Spatial Localization (QGSL) module to eliminate redundancy across both domains. Integrated with explicit temporal alignment, multi-layer interactive queries, positive-negative frame sampling, and end-to-end multitask training, the method achieves state-of-the-art performance among multimodal large language models on VidSTG, improving m_vIoU from 26.4 to 34.3, and demonstrates strong cross-task transferability.
📝 Abstract
Spatio-Temporal Video Grounding requires jointly localizing target objects across both temporal and spatial dimensions based on natural language queries, posing fundamental challenges for existing Multimodal Large Language Models (MLLMs). We identify two core challenges: \textit{entangled spatio-temporal alignment}, arising from coupling two heterogeneous sub-tasks within the same autoregressive output space, and \textit{dual-domain visual token redundancy}, where target objects exhibit simultaneous temporal and spatial sparsity, rendering the overwhelming majority of visual tokens irrelevant to the grounding query. To address these, we propose \textbf{Bridge-STG}, an end-to-end framework that decouples temporal and spatial localization while maintaining semantic coherence. While decoupling is the natural solution to this entanglement, it risks creating a semantic gap between the temporal MLLM and the spatial decoder. Bridge-STG resolves this through two pivotal designs: the \textbf{Spatio-Temporal Semantic Bridging (STSB)} mechanism with Explicit Temporal Alignment (ETA) distills the MLLM's temporal reasoning context into enriched bridging queries as a robust semantic interface; and the \textbf{Query-Guided Spatial Localization (QGSL)} module leverages these queries to drive a purpose-built spatial decoder with multi-layer interactive queries and positive/negative frame sampling, jointly eliminating dual-domain visual token redundancy. Extensive experiments across multiple benchmarks demonstrate that Bridge-STG achieves state-of-the-art performance among MLLM-based methods. Bridge-STG improves average m\_vIoU from $26.4$ to $34.3$ on VidSTG and demonstrates strong cross-task transfer across various fine-grained video understanding tasks under a unified multi-task training regime.
Problem

Research questions and friction points this paper is trying to address.

Spatio-Temporal Video Grounding
Multimodal Large Language Models
Temporal-Spatial Alignment
Visual Token Redundancy
Video Understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decoupled Spatio-Temporal Alignment
Spatio-Temporal Semantic Bridging
Query-Guided Spatial Localization
Multimodal Large Language Models
Video Grounding
🔎 Similar Papers
No similar papers found.
X
Xuezhen Tu
Shanghai Jiao Tong University
J
Jingyu Wu
ZTE Corporation
F
Fangyu Kang
ZTE Corporation
Q
Qingpeng Nong
ZTE Corporation
K
Kaijin Zhang
ZTE Corporation
Chaoyue Niu
Chaoyue Niu
Shanghai Jiao Tong University
Device-Cloud MLOn-Device Intelligence
Fan Wu
Fan Wu
Professor, Department of Computer Science and Engineering, Shanghai Jiao Tong University
Wireless NetworkingMobile ComputingAlgorithmic Game Theory and Its Applications