Video Models Can Reason with Verifiable Rewards

๐Ÿ“… 2026-05-14
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

196K/year
๐Ÿค– AI Summary
While existing video diffusion models generate photorealistic and temporally coherent videos, they exhibit limited performance on verifiable reasoning tasks that require adherence to explicit spatial, temporal, or logical constraints. This work proposes VideoRLVR, a method that formulates video reasoning as the generation of verifiable visual trajectories and optimizes the diffusion process through rule-driven dense reward signals combined with an Early-Step Focus strategyโ€”updating the policy only during the early denoising steps. Built upon the SDE-GRPO reinforcement learning framework, VideoRLVR significantly outperforms supervised fine-tuning baselines and surpasses both open-source and closed-source video generation models on benchmark tasks such as Maze, FlowFree, and Sokoban, while reducing training latency by approximately 40%.
๐Ÿ“ Abstract
Video diffusion models have made rapid progress in perceptual realism and temporal coherence, but they remain primarily optimized for plausible generation rather than verifiable reasoning. This limitation is especially pronounced in tasks where generated videos must satisfy explicit spatial, temporal, or logical constraints. Inspired by the role of reinforcement learning with verifiable rewards (RLVR) in reasoning-oriented language models, we introduce VideoRLVR, a practical recipe for optimizing video diffusion models with rule-based feedback. VideoRLVR formulates video reasoning as the generation of verifiable visual trajectories and consists of an SDE-GRPO optimization backbone, dense decomposed rewards, and an Early-Step Focus strategy for efficient training. The Early-Step Focus strategy restricts policy optimization to the early denoising phase, reducing training latency by about 40% while preserving performance. We evaluate VideoRLVR on Maze, FlowFree, and Sokoban, three procedurally generated domains with objective success criteria. Across these tasks, VideoRLVR consistently improves over supervised fine-tuning baselines, with dense decomposed rewards proving especially important in low-success-rate settings. Our RL-optimized model also outperforms the evaluated proprietary and open-source video generation models on these verifiable reasoning benchmarks and out-of-domain benchmarks. These results suggest that verifiable RL can move video models beyond perceptual imitation toward more reliable rule-consistent visual reasoning.
Problem

Research questions and friction points this paper is trying to address.

video diffusion models
verifiable reasoning
spatial constraints
temporal constraints
logical constraints
Innovation

Methods, ideas, or system contributions that make the work stand out.

VideoRLVR
verifiable reasoning
diffusion models
reinforcement learning
dense decomposed rewards