Scaling RL to Long Videos

📅 2025-07-10

📈 Citations: 0

✨ Influential: 0

career value

250K/year

🤖 AI Summary

To address challenges in long-video reasoning—including poor temporal modeling, high computational overhead, and performance degradation with increasing video length—this paper introduces MR-SP, a novel framework. It establishes LongVideo-Reason, the first high-quality benchmark specifically designed for long-video reasoning; proposes a two-stage training paradigm comprising chain-of-thought supervised fine-tuning followed by reinforcement learning; and integrates sequence parallelism with a vLLM-powered cached embedding engine to enable efficient training on hour-long videos. The resulting LongVILA-R1-7B model significantly outperforms existing open-source models across multiple long-video QA benchmarks, approaching the performance of Gemini-1.5-Pro, while achieving a 2.1× speedup in single-node training throughput. The core contribution is the first scalable, multimodal, multi-model collaborative RL-based long-video reasoning system—demonstrating, for the first time, sustained performance improvement as video length increases.

Technology Category

Application Category

📝 Abstract

We introduce a full-stack framework that scales up reasoning in vision-language models (VLMs) to long videos, leveraging reinforcement learning. We address the unique challenges of long video reasoning by integrating three critical components: (1) a large-scale dataset, LongVideo-Reason, comprising 52K long video QA pairs with high-quality reasoning annotations across diverse domains such as sports, games, and vlogs; (2) a two-stage training pipeline that extends VLMs with chain-of-thought supervised fine-tuning (CoT-SFT) and reinforcement learning (RL); and (3) a training infrastructure for long video RL, named Multi-modal Reinforcement Sequence Parallelism (MR-SP), which incorporates sequence parallelism and a vLLM-based engine tailored for long video, using cached video embeddings for efficient rollout and prefilling. In experiments, LongVILA-R1-7B achieves strong performance on long video QA benchmarks such as VideoMME. It also outperforms Video-R1-7B and even matches Gemini-1.5-Pro across temporal reasoning, goal and purpose reasoning, spatial reasoning, and plot reasoning on our LongVideo-Reason-eval benchmark. Notably, our MR-SP system achieves up to 2.1x speedup on long video RL training. LongVILA-R1 demonstrates consistent performance gains as the number of input video frames scales. LongVILA-R1 marks a firm step towards long video reasoning in VLMs. In addition, we release our training system for public availability that supports RL training on various modalities (video, text, and audio), various models (VILA and Qwen series), and even image and video generation models. On a single A100 node (8 GPUs), it supports RL training on hour-long videos (e.g., 3,600 frames / around 256k tokens).

Problem

Research questions and friction points this paper is trying to address.

Scaling reinforcement learning to long videos for vision-language models

Addressing challenges in long video reasoning with a two-stage training pipeline

Improving efficiency in long video RL training with Multi-modal Reinforcement Sequence Parallelism

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale dataset LongVideo-Reason for diverse domains

Two-stage training with CoT-SFT and RL

MR-SP system for efficient long video RL

🔎 Similar Papers

LVBench: An Extreme Long Video Understanding Benchmark