VideoRFT: Incentivizing Video Reasoning Capability in MLLMs via Reinforced Fine-Tuning

📅 2025-05-18

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

To address weak causal and temporal reasoning capabilities in video understanding, this paper proposes VideoRFT, a reinforcement fine-tuning framework. First, an automated chain-of-thought (CoT) pipeline generates 102K high-quality supervised fine-tuning (SFT) samples. Second, a semantic consistency reward function guides Proximal Policy Optimization (PPO)-based reinforcement learning to align visual and linguistic representations, yielding 310K RL samples. Innovatively integrating cognition-inspired prompting, structured video representation, and visual consistency distillation, VideoRFT achieves state-of-the-art performance across six video reasoning benchmarks. It significantly mitigates visual hallucination while improving reasoning coherence and factual accuracy. Furthermore, this work introduces the first large-scale video CoT dataset—VideoCoT—establishing a new paradigm and foundational resource for multimodal temporal reasoning research.

Technology Category

Application Category

📝 Abstract

Reinforcement fine-tuning (RFT) has shown great promise in achieving humanlevel reasoning capabilities of Large Language Models (LLMs), and has recently been extended to MLLMs. Nevertheless, reasoning about videos, which is a fundamental aspect of human intelligence, remains a persistent challenge due to the complex logic, temporal and causal structures inherent in video data. To fill this gap, we propose VIDEORFT, a novel approach that extends the RFT paradigm to cultivate human-like video reasoning capabilities in MLLMs. VIDEORFT follows the standard two-stage scheme in RFT: supervised fine-tuning (SFT) with chain-of-thought (CoT) annotations, followed by reinforcement learning (RL) to improve generalization. A central challenge to achieve this in the video domain lies in the scarcity of large-scale, high-quality video CoT datasets. We address this by building a fully automatic CoT curation pipeline. First, we devise a cognitioninspired prompting strategy to elicit a reasoning LLM to generate preliminary CoTs based solely on rich, structured, and literal representations of video content. Subsequently, these CoTs are revised by a visual-language model conditioned on the actual video, ensuring visual consistency and reducing visual hallucinations. This pipeline results in two new datasets - VideoRFT-CoT-102K for SFT and VideoRFT-RL-310K for RL. To further strength the RL phase, we introduce a novel semantic-consistency reward that explicitly promotes the alignment between textual reasoning with visual evidence. This reward encourages the model to produce coherent, context-aware reasoning outputs grounded in visual input. Extensive experiments show that VIDEORFT achieves state-of-the-art performance on six video reasoning benchmarks.

Problem

Research questions and friction points this paper is trying to address.

Enhancing video reasoning in MLLMs via reinforced fine-tuning

Addressing scarcity of high-quality video CoT datasets

Improving alignment between textual reasoning and visual evidence

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement fine-tuning extends to video reasoning MLLMs

Automatic CoT curation pipeline generates video reasoning datasets

Semantic-consistency reward aligns reasoning with visual evidence

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs

2024-06-26Citations: 4

Through the Theory of Mind's Eye: Reading Minds with Multimodal Video Large Language Models

2024-06-19arXiv.orgCitations: 5

VideoPrism: A Foundational Visual Encoder for Video Understanding

2024-02-20International Conference on Machine LearningCitations: 30