Video-RTS: Rethinking Reinforcement Learning and Test-Time Scaling for Efficient and Enhanced Video Reasoning

📅 2025-07-08

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

To address the high annotation cost and poor scalability of supervised fine-tuning (SFT) and long-chain-of-thought labeling in video reasoning, this paper proposes a fully reinforcement learning (RL)-based framework that eliminates the need for SFT. Our method introduces two key innovations: (1) an output-reward-driven RL training mechanism that bypasses reliance on human-annotated reasoning traces; and (2) a sparse-to-dense test-time scaling strategy integrating video-adaptive frame selection with output consistency evaluation to enable dynamic computational resource allocation. Evaluated on Video-Holmes and MMVU benchmarks, our approach achieves an average accuracy improvement of 2.4% using only 3.6% of the training samples—yielding gains of +4.2% on Video-Holmes and +2.6% on MMVU. This demonstrates substantial improvements in both data efficiency and computational efficiency.

Technology Category

Application Category

📝 Abstract

Despite advances in reinforcement learning (RL)-based video reasoning with large language models (LLMs), data collection and finetuning remain significant challenges. These methods often rely on large-scale supervised fine-tuning (SFT) with extensive video data and long Chain-of-Thought (CoT) annotations, making them costly and hard to scale. To address this, we present Video-RTS, a new approach to improve video reasoning capability with drastically improved data efficiency by combining data-efficient RL with a video-adaptive test-time scaling (TTS) strategy. Based on observations about the data scaling of RL samples, we skip the resource-intensive SFT step and employ efficient pure-RL training with output-based rewards, requiring no additional annotations or extensive fine-tuning. Furthermore, to utilize computational resources more efficiently, we introduce a sparse-to-dense video TTS strategy that improves inference by iteratively adding frames based on output consistency. We validate our approach on multiple video reasoning benchmarks, showing that Video-RTS surpasses existing video reasoning models by an average of 2.4% in accuracy using only 3.6% training samples. For example, Video-RTS achieves a 4.2% improvement on Video-Holmes, a recent and challenging video reasoning benchmark, and a 2.6% improvement on MMVU. Notably, our pure RL training and adaptive video TTS offer complementary strengths, enabling Video-RTS's strong reasoning performance.

Problem

Research questions and friction points this paper is trying to address.

Improving video reasoning with efficient RL and TTS

Reducing reliance on costly SFT and annotations

Enhancing accuracy with sparse-to-dense frame strategy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Data-efficient RL training without SFT

Video-adaptive test-time scaling strategy

Sparse-to-dense frame addition for inference

🔎 Similar Papers

VideoPrism: A Foundational Visual Encoder for Video Understanding