LOVE-R1: Advancing Long Video Understanding with an Adaptive Zoom-in Mechanism via Multi-Step Reasoning

📅 2025-09-29

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

Long-video understanding faces a fundamental trade-off between temporal coverage and spatial fidelity: uniform frame sampling inherently compromises between sampling density and spatial resolution, impairing either temporal modeling or fine-grained perception. To address this, we propose an adaptive scaling multi-step reasoning framework that decouples spatiotemporal modeling: first performing coarse-grained temporal localization via dense low-resolution sampling, then dynamically upscaling resolution for key segments to enable local fine-grained reconstruction. Our approach integrates slow-fast dual-mode sampling, chain-of-thought fine-tuning, and decoupled reinforcement learning optimization. Evaluated on four mainstream long-video understanding benchmarks, our method achieves an average 3.1-point improvement over Qwen2.5-VL, marking the first demonstration of synergistic enhancement in both long-horizon temporal reasoning and high-fidelity spatial perception.

Technology Category

Application Category

📝 Abstract

Long video understanding is still challenging for recent Large Video-Language Models (LVLMs) due to the conflict between long-form temporal understanding and detailed spatial perception. LVLMs with a uniform frame sampling mechanism, which samples frames with an equal frame size and fixed sampling rate, inevitably sacrifice either temporal clues or spatial details, resulting in suboptimal solutions. To mitigate this dilemma, we propose LOVE-R1, a model that can adaptively zoom in on a video clip. The model is first provided with densely sampled frames but in a small resolution. If some spatial details are needed, the model can zoom in on a clip of interest with a large frame resolution based on its reasoning until key visual information is obtained. The whole process is implemented as a multi-step reasoning process. To train the reasoning ability, we first finetune the model on our collected 38k high-quality CoT data and enhance it with decoupled reinforcement finetuning. As outcome rewards can not provide fine-grained process supervision, we decouple multi-step reasoning into multiple single-step reasoning and optimize the internal zoom-in ability explicitly. Experiments on long video understanding benchmarks show that our model with the slow-fast adaptive frame sampling mechanism achieves a great trade-off between sampling density and frame resolutions, and LOVE-R1 outperforms our baseline Qwen2.5-VL by an average of 3.1% points across 4 common long video understanding benchmarks.

Problem

Research questions and friction points this paper is trying to address.

Adaptive zoom-in mechanism resolves long video understanding conflicts

Multi-step reasoning enables dynamic frame resolution adjustments

Decoupled reinforcement finetuning optimizes spatial-temporal perception trade-offs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive zoom-in mechanism for video frames

Multi-step reasoning with slow-fast sampling

Decoupled reinforcement finetuning for step optimization

🔎 Similar Papers

HERMES: temporal-coHERent long-forM understanding with Episodes and Semantics