π€ AI Summary
Video multimodal large language models (MLLMs) suffer from weak spatiotemporal perception and struggle to balance specialized understanding with general-purpose capabilities.
Method: This paper proposes a multi-task reinforcement fine-tuning (RFT) framework based on Group Relative Policy Optimization (GRPO)βthe first systematic application of GRPO to video MLLM training. It integrates rule-guided reward modeling with fine-grained spatiotemporal-aware objectives, enabling joint enhancement of spatiotemporal reasoning and open-ended dialogue proficiency under ultra-low-data regimes.
Results: Experiments demonstrate substantial gains: +31.8 in temporal localization and +31.2 in object tracking over Qwen2.5-VL-7B. Moreover, the method achieves consistent improvements of 0.9β1.0 points on comprehensive benchmarks including VideoMME and MVBench, validating its effectiveness in eliciting emergent spatiotemporal reasoning capabilities.
π Abstract
Recent advancements in reinforcement learning have significantly advanced the reasoning capabilities of multimodal large language models (MLLMs). While approaches such as Group Relative Policy Optimization (GRPO) and rule-based reward mechanisms demonstrate promise in text and image domains, their application to video understanding remains limited. This paper presents a systematic exploration of Reinforcement Fine-Tuning (RFT) with GRPO for video MLLMs, aiming to enhance spatio-temporal perception while maintaining general capabilities. Our experiments reveal that RFT is highly data-efficient for task-specific improvements. Through multi-task RFT on spatio-temporal perception objectives with limited samples, we develop VideoChat-R1, a powerful video MLLM that achieves state-of-the-art performance on spatio-temporal perception tasks without sacrificing chat ability, while exhibiting emerging spatio-temporal reasoning abilities. Compared to Qwen2.5-VL-7B, VideoChat-R1 boosts performance several-fold in tasks like temporal grounding (+31.8) and object tracking (+31.2). Additionally, it significantly improves on general QA benchmarks such as VideoMME (+0.9), MVBench (+1.0), and Perception Test (+0.9). Our findings underscore the potential of RFT for specialized task enhancement of Video MLLMs. We hope our work offers valuable insights for future RL research in video MLLMs.