🤖 AI Summary
This work addresses the tendency of current video large language models to rely on shortcuts—such as single-frame cues or linguistic priors—during reinforcement learning fine-tuning, often neglecting the spatiotemporal dynamics inherent in videos. To mitigate this, the authors propose Counterfactual Relational Policy Optimization (CRPO), a novel framework that introduces counterfactual reasoning into video-based reinforcement learning for the first time. CRPO constructs counterfactual video pairs via horizontal flipping and temporal reversal, employs a dual-branch joint training mechanism, and incorporates a counterfactual relational reward that encourages the model to alter its answers for dynamic questions while maintaining consistency for static ones, thereby enhancing spatiotemporal sensitivity. The study also introduces DyBench, the first paired counterfactual evaluation benchmark, along with a pairwise accuracy metric. Evaluated on Qwen3-VL-8B, CRPO achieves a 7.7% gain in DyBench pairwise accuracy and an 8.2% improvement in TimeBlind accuracy, effectively suppressing shortcut learning without compromising general performance.
📝 Abstract
Video large language models (Video LLMs) achieve strong benchmark accuracy, yet often answer video questions through shortcuts such as single-frame cues and language priors rather than by tracking spatiotemporal dynamics. This issue is exacerbated in RL post-training, where correctness-only rewards can further reinforce shortcut policies that obtain high reward without tracking video dynamics. We address this by asking a controlled counterfactual question: if the visual world changed while the question remained fixed, should the answer change or stay the same? Based on this view, we propose \textbf{Counterfactual Relational Policy Optimization (CRPO)}, a dual-branch RL framework for improving \emph{spatiotemporal sensitivity}. CRPO constructs counterfactual videos through horizontal flips and temporal reversals, trains on both original and counterfactual branches, and introduces a \textbf{Counterfactual Relation Reward (CRR)} between their answers. CRR encourages answers to change for dynamic questions and remain unchanged for static questions. This cross-branch constraint makes it difficult for shortcut policies to be consistently rewarded across both branches. To evaluate this property, we introduce \textbf{DyBench}, a paired counterfactual video benchmark with 3,014 videos covering reversible dynamics, moving direction, and event sequence, together with a strict pair-accuracy metric that prevents fixed-answer shortcuts from inflating scores. Experiments show that CRPO outperforms prior RL methods on spatiotemporal-sensitive evaluations while maintaining competitive general video performance. On Qwen3-VL-8B, CRPO improves DyBench P-Acc by +7.7 and TimeBlind I-Acc by +8.2 over the base model, indicating improved spatiotemporal sensitivity rather than stronger reliance on static shortcuts. The project website can be found at https://ddz16.github.io/crpo.github.io/ .