Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence

📅 2025-10-23

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Existing video reasoning models generate only textual reasoning chains, lacking explicit localization of spatiotemporal evidence (i.e., timestamps and spatial bounding boxes), resulting in poor interpretability and a disconnect between explanations and visual grounding. To address this, we propose an explicit spatiotemporal evidence–based video reasoning framework featuring a novel joint temporal localization and spatial tracking mechanism that synchronously annotates key frames, target objects, and their bounding boxes during inference. We introduce STGR-CoT-30k/STGR-RL-36k—the first high-quality dataset supporting joint spatiotemporal supervision—and design a cold-start reinforcement learning strategy with multi-stage end-to-end training and a spatiotemporally aware reward function to jointly optimize answer accuracy, temporal alignment, and spatial localization precision. Our method achieves +14.4% mAM and +24.2% mLGM on V-STAR, and sets new state-of-the-art results across VideoMME and WorldSense. It is the first to enable verifiable, confidence-aware, fine-grained evidence tracing.

Technology Category

Application Category

📝 Abstract

Most video reasoning models only generate textual reasoning traces without indicating when and where key evidence appears. Recent models such as OpenAI-o3 have sparked wide interest in evidence-centered reasoning for images, yet extending this ability to videos is more challenging, as it requires joint temporal tracking and spatial localization across dynamic scenes. We introduce Open-o3 Video, a non-agent framework that integrates explicit spatio-temporal evidence into video reasoning, and carefully collect training data and design training strategies to address the aforementioned challenges. The model highlights key timestamps, objects, and bounding boxes alongside its answers, allowing reasoning to be grounded in concrete visual observations. To enable this functionality, we first curate and build two high-quality datasets, STGR-CoT-30k for SFT and STGR-RL-36k for RL, with carefully constructed temporal and spatial annotations, since most existing datasets offer either temporal spans for videos or spatial boxes on images, lacking unified spatio-temporal supervision and reasoning traces. Then, we adopt a cold-start reinforcement learning strategy with multiple specially designed rewards that jointly encourage answer accuracy, temporal alignment, and spatial precision. On V-STAR benchmark, Open-o3 Video achieves state-of-the-art performance, raising mAM by 14.4% and mLGM by 24.2% on the Qwen2.5-VL baseline. Consistent improvements are also observed on a broad range of video understanding benchmarks, including VideoMME, WorldSense, VideoMMMU, and TVGBench. Beyond accuracy, the reasoning traces produced by Open-o3 Video also provide valuable signals for test-time scaling, enabling confidence-aware verification and improving answer reliability.

Problem

Research questions and friction points this paper is trying to address.

Video reasoning lacks explicit spatio-temporal evidence localization

Existing datasets lack unified spatio-temporal supervision and reasoning traces

Models require joint temporal tracking and spatial localization in videos

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates explicit spatio-temporal evidence into video reasoning

Uses curated datasets with unified spatio-temporal annotations

Adopts cold-start reinforcement learning with multi-reward strategy

🔎 Similar Papers

Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models