1 + 1 > 2: Detector-Empowered Video Large Language Model for Spatio-Temporal Grounding and Reasoning

📅 2025-12-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video multimodal large language models (MLLMs) model bounding boxes as autoregressive text sequences, leading to verbose outputs, accumulating spatial errors over time, and localization drift. This work proposes a collaborative framework integrating a video LLM with an open-vocabulary detector. Its core innovations are: (1) a Reference-Semantic Token (RST) mechanism, which leverages the user query’s semantics both as a control signal and as a substitute for textual embeddings, enabling end-to-end referring understanding and grounding; and (2) Tubular Temporal Regularization (TTReg), which enforces temporal consistency of object trajectories across frames. By circumventing error-prone autoregressive coordinate generation, the method significantly improves spatiotemporal localization accuracy and enhances complex semantic reasoning—such as causal and sequential inference—on fine-grained video understanding benchmarks including STVG and GroundedVQA. Results validate the efficacy of co-modeling detection priors with large language models.

Technology Category

Application Category

📝 Abstract
Spatio-temporal grounding and reasoning aims to locate the temporal segment and spatial region of an event in a video given a user query, while also reasoning about semantics such as causality, temporal order, and action relationships. To achieve this, current MLLMs primarily treats bounding boxes as text tokens and generates them autoregressively. However, such autoregressive spatial decoding leads to very-long output sequences, causing spatial errors to accumulated over time and the localization results to progressively drift across a video. To address this, we present a Detector-Empowered Video LLM, short for DEViL, which couples a Video LLM with an open-vocabulary detector (OVD). Specifically, the MLLM and detector are connected via a reference-semantic token (RST) that distills the user query into a rich semantic representation. Unlike tokens that merely serve as spatial prompts or segmentor switches, the RST functions as both a control signal and a replacement for the OVD's text embedding, enabling end-to-end learning of both referential understanding and spatial localization. Furthermore, we propose a tube-mined temporal regularization (TTReg) within OVD, which drives the OVD to generate temporally-consistent queries for target objects, thereby ensuring effective temporal association. Experiments demonstrate that DEViL achieves strong performance across various fine-grained video understanding tasks, particularly STVG and GroundedVQA. Code will be released on https://github.com/gaostar123/DeViL.
Problem

Research questions and friction points this paper is trying to address.

Autoregressive spatial decoding causes error accumulation in video localization
Existing methods struggle with temporal consistency in object tracking
Current models inefficiently handle spatio-temporal grounding and reasoning tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Couples Video LLM with open-vocabulary detector via reference-semantic token
Uses reference-semantic token as control signal and text embedding replacement
Introduces tube-mined temporal regularization for consistent object queries
🔎 Similar Papers
S
Shida Gao
Beijing University of Posts and Telecommunications
F
Feng Xue
University of Trento
X
Xiangfeng Wang
Beijing University of Posts and Telecommunications
A
Anlong Ming
Beijing University of Posts and Telecommunications
Teng Long
Teng Long
University of Amsterdam
Hyperbolic LearningQuantizationRetrieval
Y
Yihua Shao
Institute of Automation, Chinese Academy of Sciences
H
Haozhe Wang
Hong Kong University of Science and Technology
Z
Zhaowen Lin
Beijing University of Posts and Telecommunications
W
Wei Wang
ZTE Corporation
Nicu Sebe
Nicu Sebe
University of Trento
computer visionmultimedia