Agentic Spatio-Temporal Grounding via Collaborative Reasoning

📅 2026-02-10

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

This work addresses the challenge of text-query-driven spatiotemporal object localization in videos, a task often hindered by computational redundancy, reliance on strong supervision, and limited generalization. The authors propose ASTG, a novel framework that introduces, for the first time, a multi-agent collaboration mechanism to achieve end-to-end zero-shot localization in open-world settings without any training. ASTG employs a Spatial Reasoning Agent (SRA) and a Temporal Reasoning Agent (TRA), both built upon a multimodal large language model, which jointly leverage visual memory and dialogue context. Through a “propose-and-evaluate” paradigm, the agents decouple spatial and temporal reasoning to autonomously extract and verify target tubes. Experiments demonstrate that ASTG significantly outperforms existing weakly supervised and zero-shot methods on standard benchmarks, achieving performance comparable to certain fully supervised models.

Technology Category

Application Category

📝 Abstract

Spatio-Temporal Video Grounding (STVG) aims to retrieve the spatio-temporal tube of a target object or person in a video given a text query. Most existing approaches perform frame-wise spatial localization within a predicted temporal span, resulting in redundant computation, heavy supervision requirements, and limited generalization. Weakly-supervised variants mitigate annotation costs but remain constrained by the dataset-level train-and-fit paradigm with an inferior performance. To address these challenges, we propose the Agentic Spatio-Temporal Grounder (ASTG) framework for the task of STVG towards an open-world and training-free scenario. Specifically, two specialized agents SRA (Spatial Reasoning Agent) and TRA (Temporal Reasoning Agent) constructed leveraging on modern Multimoal Large Language Models (MLLMs) work collaboratively to retrieve the target tube in an autonomous and self-guided manner. Following a propose-and-evaluation paradigm, ASTG duly decouples spatio-temporal reasoning and automates the tube extraction, verification and temporal localization processes. With a dedicate visual memory and dialogue context, the retrieval efficiency is significantly enhanced. Experiments on popular benchmarks demonstrate the superiority of the proposed approach where it outperforms existing weakly-supervised and zero-shot approaches by a margin and is comparable to some of the fully-supervised methods.

Problem

Research questions and friction points this paper is trying to address.

Spatio-Temporal Video Grounding

weakly-supervised learning

generalization

annotation cost

redundant computation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Agentic Reasoning

Spatio-Temporal Video Grounding

Multimodal Large Language Models