๐ค AI Summary
Current video world models struggle with precise human action following and suffer from a lack of large-scale preference annotations and verifiable, rule-based reward signals. To address these challenges, this paper proposes RLIR (Reinforcement Learning with Inverse Rewards), an inverse-reward reinforcement learning framework. RLIR introduces, for the first time, an inverse dynamics model to reconstruct action sequences from generated videosโyielding low-dimensional, annotation-free, and verifiable reward signals. Integrated with Group Relative Policy Optimization (GRPO), RLIR enables efficient post-training on both autoregressive and diffusion-based video generation architectures. On multiple benchmarks, RLIR achieves 5โ10% improvements in action-following accuracy and up to 10% gains in video visual quality, while attaining significantly higher human preference scores than state-of-the-art methods. This work establishes a novel unsupervised paradigm for video-action alignment.
๐ Abstract
World models simulate dynamic environments, enabling agents to interact with diverse input modalities. Although recent advances have improved the visual quality and temporal consistency of video world models, their ability of accurately modeling human-specified actions remains under-explored. Reinforcement learning presents a promising approach for directly improving the suboptimal action-following capability of pre-trained models, assuming that an appropriate reward function can be defined. However, transferring reinforcement learning post-training methods to world model is impractical due to the prohibitive cost of large-scale preference annotations and the infeasibility of constructing rule-based video verifiers. To address this gap, we propose Reinforcement Learning with Inverse Rewards (RLIR), a post-training framework that derives verifiable reward signals by recovering input actions from generated videos using an Inverse Dynamics Model. By mapping high-dimensional video modality to a low-dimensional action space, RLIR provides an objective and verifiable reward for optimization via Group Relative Policy Optimization. Experiments across autoregressive and diffusion paradigms demonstrate 5-10% gains in action-following, up to 10% improvements in visual quality, and higher human preference scores, establishing RLIR as the first post-training method specifically designed to enhance action-following in video world models.