LOVE-R1: Advancing Long Video Understanding with an Adaptive Zoom-in Mechanism via Multi-Step Reasoning

📅 2025-09-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Long-video understanding faces a fundamental trade-off between temporal coverage and spatial fidelity: uniform frame sampling inherently compromises between sampling density and spatial resolution, impairing either temporal modeling or fine-grained perception. To address this, we propose an adaptive scaling multi-step reasoning framework that decouples spatiotemporal modeling: first performing coarse-grained temporal localization via dense low-resolution sampling, then dynamically upscaling resolution for key segments to enable local fine-grained reconstruction. Our approach integrates slow-fast dual-mode sampling, chain-of-thought fine-tuning, and decoupled reinforcement learning optimization. Evaluated on four mainstream long-video understanding benchmarks, our method achieves an average 3.1-point improvement over Qwen2.5-VL, marking the first demonstration of synergistic enhancement in both long-horizon temporal reasoning and high-fidelity spatial perception.

Technology Category

Application Category

📝 Abstract
Long video understanding is still challenging for recent Large Video-Language Models (LVLMs) due to the conflict between long-form temporal understanding and detailed spatial perception. LVLMs with a uniform frame sampling mechanism, which samples frames with an equal frame size and fixed sampling rate, inevitably sacrifice either temporal clues or spatial details, resulting in suboptimal solutions. To mitigate this dilemma, we propose LOVE-R1, a model that can adaptively zoom in on a video clip. The model is first provided with densely sampled frames but in a small resolution. If some spatial details are needed, the model can zoom in on a clip of interest with a large frame resolution based on its reasoning until key visual information is obtained. The whole process is implemented as a multi-step reasoning process. To train the reasoning ability, we first finetune the model on our collected 38k high-quality CoT data and enhance it with decoupled reinforcement finetuning. As outcome rewards can not provide fine-grained process supervision, we decouple multi-step reasoning into multiple single-step reasoning and optimize the internal zoom-in ability explicitly. Experiments on long video understanding benchmarks show that our model with the slow-fast adaptive frame sampling mechanism achieves a great trade-off between sampling density and frame resolutions, and LOVE-R1 outperforms our baseline Qwen2.5-VL by an average of 3.1% points across 4 common long video understanding benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Adaptive zoom-in mechanism resolves long video understanding conflicts
Multi-step reasoning enables dynamic frame resolution adjustments
Decoupled reinforcement finetuning optimizes spatial-temporal perception trade-offs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive zoom-in mechanism for video frames
Multi-step reasoning with slow-fast sampling
Decoupled reinforcement finetuning for step optimization
Shenghao Fu
Shenghao Fu
Sun Yat-sen University
computer visionobject detectionlarge multi-modal models
Qize Yang
Qize Yang
Tongyi Lab, Alibaba Group
Computer VisionDeep Learning
Yuan-Ming Li
Yuan-Ming Li
Sun Yat-sen University
Computer Vision
X
Xihan Wei
Tongyi Lab, Alibaba Group
X
Xiaohua Xie
School of Computer Science and Engineering, Sun Yat-sen University, China; Guangdong Province Key Laboratory of Information Security Technology, China; Pazhou Laboratory (Huangpu), China
Wei-Shi Zheng
Wei-Shi Zheng
Professor @ SUN YAT-SEN UNIVERSITY
Computer VisionPattern RecognitionMachine Learning