Thinking With Bounding Boxes: Enhancing Spatio-Temporal Video Grounding via Reinforcement Fine-Tuning

📅 2025-11-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multimodal large language models (MLLMs) suffer from suboptimal performance in spatio-temporal video grounding (STVG) due to misaligned training objectives and insufficient fine-grained region–word alignment capability of standard visual encoders. To address this without architectural modification, we propose a refined fine-tuning framework. Our contributions are threefold: (1) We introduce Box Chain-of-Thought—a novel, explicit modeling of the progressive reasoning process for spatio-temporal localization; (2) We design a geometric-aware supervision signal coupled with a multi-dimensional reinforcement learning reward function, jointly optimizing localization accuracy, temporal consistency, and semantic alignment; (3) We enhance the fine-grained region–word alignment capacity of off-the-shelf visual encoders. On HCSTVG-v1, our method achieves a 7.3% absolute gain in mean temporal-IoU over prior state-of-the-art methods and significantly outperforms existing MLLM-based approaches, demonstrating strong open-vocabulary generalization.

Technology Category

Application Category

📝 Abstract
Spatio-temporal video grounding (STVG) requires localizing a target object in untrimmed videos both temporally and spatially from natural language descriptions. Despite their strong language understanding, multimodal large language models (MLLMs) underperform on STVG due to misaligned training objectives and weak fine-grained region-word alignment in standard visual encoders. To address this, we propose STVG-o1, the first framework that enables off-the-shelf MLLMs to achieve state-of-the-art STVG performance without any architectural modifications. Our method introduces a bounding-box chain-of-thought mechanism that explicitly reasons about spatio-temporal locations in an intermediate step before producing the final prediction. We further design a multi-dimensional reinforcement reward function consisting of format, consistency, temporal, spatial, and think rewards, which provides geometry-aware supervision through reinforcement fine-tuning. Evaluated on HCSTVG-v1/v2 and VidSTG, STVG-o1 sets new state-of-the-art results on HCSTVG, outperforming the best task-specific method by 7.3% m_tIoU on HCSTVG-v1, matching specialized models on VidSTG, and surpassing all existing MLLM-based approaches by large margins. It also demonstrates strong open-vocabulary generalization across datasets, establishing MLLMs as viable and powerful backbones for precise spatio-temporal grounding. Our code and models will be released.
Problem

Research questions and friction points this paper is trying to address.

Improving object localization in videos using language descriptions
Addressing misalignment between multimodal models and spatio-temporal tasks
Enhancing fine-grained region-word alignment in video understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bounding-box chain-of-thought mechanism for reasoning
Multi-dimensional reinforcement reward function design
Geometry-aware supervision through reinforcement fine-tuning
X
Xin Gu
ByteDance Intelligent Creation
H
Haoji Zhang
Tsinghua University
Qihang Fan
Qihang Fan
Phd Student, Institute of Automation, Chinese Academy of Sciences
computer visionmulti-modal large language modeldeep learning architecture
J
Jingxuan Niu
Tsinghua University
Zhipeng Zhang
Zhipeng Zhang
School of Artificial Intelligence, Shanghai Jiao Tong University
Computer Vision,Object Tracking and Segmentation
L
Libo Zhang
Institute of Software, Chinese Academy of Sciences
G
Guang Chen
ByteDance Intelligent Creation
F
Fan Chen
ByteDance Intelligent Creation
Longyin Wen
Longyin Wen
Bytedance Inc.
Artificial IntelligenceComputer VisionMachine Learning
Sijie Zhu
Sijie Zhu
Unknown affiliation