Reinforcement Learning with Inverse Rewards for World Model Post-training

📅 2025-09-28

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Current video world models struggle with precise human action following and suffer from a lack of large-scale preference annotations and verifiable, rule-based reward signals. To address these challenges, this paper proposes RLIR (Reinforcement Learning with Inverse Rewards), an inverse-reward reinforcement learning framework. RLIR introduces, for the first time, an inverse dynamics model to reconstruct action sequences from generated videos—yielding low-dimensional, annotation-free, and verifiable reward signals. Integrated with Group Relative Policy Optimization (GRPO), RLIR enables efficient post-training on both autoregressive and diffusion-based video generation architectures. On multiple benchmarks, RLIR achieves 5–10% improvements in action-following accuracy and up to 10% gains in video visual quality, while attaining significantly higher human preference scores than state-of-the-art methods. This work establishes a novel unsupervised paradigm for video-action alignment.

Technology Category

Application Category

📝 Abstract

World models simulate dynamic environments, enabling agents to interact with diverse input modalities. Although recent advances have improved the visual quality and temporal consistency of video world models, their ability of accurately modeling human-specified actions remains under-explored. Reinforcement learning presents a promising approach for directly improving the suboptimal action-following capability of pre-trained models, assuming that an appropriate reward function can be defined. However, transferring reinforcement learning post-training methods to world model is impractical due to the prohibitive cost of large-scale preference annotations and the infeasibility of constructing rule-based video verifiers. To address this gap, we propose Reinforcement Learning with Inverse Rewards (RLIR), a post-training framework that derives verifiable reward signals by recovering input actions from generated videos using an Inverse Dynamics Model. By mapping high-dimensional video modality to a low-dimensional action space, RLIR provides an objective and verifiable reward for optimization via Group Relative Policy Optimization. Experiments across autoregressive and diffusion paradigms demonstrate 5-10% gains in action-following, up to 10% improvements in visual quality, and higher human preference scores, establishing RLIR as the first post-training method specifically designed to enhance action-following in video world models.

Problem

Research questions and friction points this paper is trying to address.

Improving action-following capability in video world models

Overcoming impractical reward function design in reinforcement learning

Deriving verifiable rewards from generated videos via inverse dynamics

Innovation

Methods, ideas, or system contributions that make the work stand out.

Inverse Dynamics Model derives reward from generated videos

Group Relative Policy Optimization enables verifiable reward optimization

Post-training framework maps video modality to action space

🔎 Similar Papers

No similar papers found.

Authors to Follow