RLVR-World: Training World Models with Reinforcement Learning

📅 2025-05-20

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

Standard world model training objectives—such as maximum likelihood estimation—are fundamentally misaligned with downstream task requirements (e.g., state transition accuracy, perceptual fidelity), limiting generalization. To address this, we propose RLVR-World: the first verifiable-reward reinforcement learning framework specifically designed for world models. It replaces token-level losses with PPO reward signals derived directly from task-verifiable metrics—such as visual fidelity and action feasibility—computed on decoded predictions, enabling end-to-end task-aligned optimization. RLVR-World integrates sequential modeling, multimodal tokenization (text and video), and an autoregressive prediction architecture. Empirically, it significantly improves performance of both language- and video-based world models across diverse tasks—including text-based games, web navigation, and robotic manipulation—demonstrating the broad efficacy of verifiable-reward-based RL post-training in generative world modeling.

Technology Category

Application Category

📝 Abstract

World models predict state transitions in response to actions and are increasingly developed across diverse modalities. However, standard training objectives such as maximum likelihood estimation (MLE) often misalign with task-specific goals of world models, i.e., transition prediction metrics like accuracy or perceptual quality. In this paper, we present RLVR-World, a unified framework that leverages reinforcement learning with verifiable rewards (RLVR) to directly optimize world models for such metrics. Despite formulating world modeling as autoregressive prediction of tokenized sequences, RLVR-World evaluates metrics of decoded predictions as verifiable rewards. We demonstrate substantial performance gains on both language- and video-based world models across domains, including text games, web navigation, and robot manipulation. Our work indicates that, beyond recent advances in reasoning language models, RLVR offers a promising post-training paradigm for enhancing the utility of generative models more broadly.

Problem

Research questions and friction points this paper is trying to address.

Misalignment between standard training objectives and task-specific world model goals

Need for direct optimization of world models for transition prediction metrics

Enhancing utility of generative models via reinforcement learning with verifiable rewards

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses reinforcement learning with verifiable rewards

Optimizes world models for task-specific metrics

Applies to diverse domains like text and video

🔎 Similar Papers

PWM: Policy Learning with Multi-Task World Models