LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding

📅 2026-02-24

📈 Citations: 0

✨ Influential: 0

career value

239K/year

🤖 AI Summary

This work proposes LongVideo-R1, a reasoning-capable multimodal large language model agent designed to address the inefficiency of long video understanding under constrained computational resources. LongVideo-R1 actively navigates to critical video segments using high-level visual cues and terminates exploration early when sufficient information is acquired, thereby avoiding redundant traversal. Built upon Qwen-3-8B, the model undergoes a two-stage training process involving supervised fine-tuning and reinforcement learning. It leverages CGBench to construct hierarchical descriptions and utilizes GPT-5 to generate 33K chain-of-thought tool trajectories, with a tailored reward function optimizing segment selection strategies. Evaluated across multiple long-video benchmarks, LongVideo-R1 achieves significantly improved inference efficiency while maintaining high accuracy. The code and data are publicly released.

Technology Category

Application Category

📝 Abstract

This paper addresses the critical and underexplored challenge of long video understanding with low computational budgets. We propose LongVideo-R1, an active, reasoning-equipped multimodal large language model (MLLM) agent designed for efficient video context navigation, avoiding the redundancy of exhaustive search. At the core of LongVideo-R1 lies a reasoning module that leverages high-level visual cues to infer the most informative video clip for subsequent processing. During inference, the agent initiates traversal from top-level visual summaries and iteratively refines its focus, immediately halting the exploration process upon acquiring sufficient knowledge to answer the query. To facilitate training, we first extract hierarchical video captions from CGBench, a video corpus with grounding annotations, and guide GPT-5 to generate 33K high-quality chain-of-thought-with-tool trajectories. The LongVideo-R1 agent is fine-tuned upon the Qwen-3-8B model through a two-stage paradigm: supervised fine-tuning (SFT) followed by reinforcement learning (RL), where RL employs a specifically designed reward function to maximize selective and efficient clip navigation. Experiments on multiple long video benchmarks validate the effectiveness of name, which enjoys superior tradeoff between QA accuracy and efficiency. All curated data and source code are provided in the supplementary material and will be made publicly available. Code and data are available at: https://github.com/qiujihao19/LongVideo-R1

Problem

Research questions and friction points this paper is trying to address.

long video understanding

low computational budget

efficient video navigation

multimodal large language model

video context reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

active video navigation

reasoning-equipped MLLM

efficient long video understanding