VideoCuRL: Video Curriculum Reinforcement Learning with Orthogonal Difficulty Decomposition

📅 2025-12-31
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitation of existing reinforcement learning approaches in video understanding, which rely on a single scalar difficulty metric and fail to distinguish between orthogonal challenges in visual-temporal perception and cognitive reasoning. To overcome this, the authors propose VideoCuRL, a novel framework that decouples video difficulty into two dimensions: visual complexity—quantified via optical flow and keyframe entropy—and cognitive complexity—measured by calibrated surprisal. These dimensions form a two-dimensional curriculum grid, traversed during training using a diagonal wavefront scheduling strategy. The method incorporates lightweight, training-free proxy metrics, dynamic sparse KL regularization, and a structured replay mechanism to enhance training stability. Experiments demonstrate consistent improvements of 2.5 and 2.9 points on VSI-Bench and VideoMME, respectively, while avoiding the inference overhead associated with generative curriculum methods.

Technology Category

Application Category

📝 Abstract
Reinforcement Learning (RL) is crucial for empowering VideoLLMs with complex spatiotemporal reasoning. However, current RL paradigms predominantly rely on random data shuffling or naive curriculum strategies based on scalar difficulty metrics. We argue that scalar metrics fail to disentangle two orthogonal challenges in video understanding: Visual Temporal Perception Load and Cognitive Reasoning Depth. To address this, we propose VideoCuRL, a novel framework that decomposes difficulty into these two axes. We employ efficient, training-free proxies, optical flow and keyframe entropy for visual complexity, Calibrated Surprisal for cognitive complexity, to map data onto a 2D curriculum grid. A competence aware Diagonal Wavefront strategy then schedules training from base alignment to complex reasoning. Furthermore, we introduce Dynamic Sparse KL and Structured Revisiting to stabilize training against reward collapse and catastrophic forgetting. Extensive experiments show that VideoCuRL surpasses strong RL baselines on reasoning (+2.5 on VSI-Bench) and perception (+2.9 on VideoMME) tasks. Notably, VideoCuRL eliminates the prohibitive inference overhead of generation-based curricula, offering a scalable solution for robust video post-training.
Problem

Research questions and friction points this paper is trying to address.

Reinforcement Learning
Video Understanding
Curriculum Learning
Difficulty Decomposition
Spatiotemporal Reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Orthogonal Difficulty Decomposition
Curriculum Reinforcement Learning
VideoLLM
Competence-aware Scheduling
Dynamic Sparse KL
🔎 Similar Papers
No similar papers found.
Hongbo Jin
Hongbo Jin
Peking University
LLMvideo LLM3D LLM
K
Kuanwei Lin
School of Electronic and Computer Engineering, Peking University
W
Wenhao Zhang
School of Electronic and Computer Engineering, Peking University
Y
Yichen Jin
School of Electronic and Computer Engineering, Peking University
Ge Li
Ge Li
Full Professor of Computer Science, Peking University
Program AnalysisProgram GenerationDeep Learning