Reinforcement World Model Learning for LLM-based Agents

πŸ“… 2026-02-05
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the limitations of current large language model (LLM)-based agents in predicting action outcomes and adapting to dynamic environments, which stem from an inadequate capacity for world modeling. The authors propose a self-supervised approach that constructs an action-conditional world model by aligning the semantic representations of simulated and real environment states within the pretrained language model’s embedding space. Departing from conventional token-prediction paradigms, the method introduces a reward mechanism based on the sim-to-real gap, effectively mitigating issues of model collapse and reward hacking. Experimental results on ALFWorld and τ² Bench demonstrate substantial improvements over baseline methods; when combined with task-success rewards, the approach achieves performance gains of 6.9 and 5.7 percentage points, respectively, matching the level of models trained on expert demonstrations.

Technology Category

Application Category

πŸ“ Abstract
Large language models (LLMs) have achieved strong performance in language-centric tasks. However, in agentic settings, LLMs often struggle to anticipate action consequences and adapt to environment dynamics, highlighting the need for world-modeling capabilities in LLM-based agents. We propose Reinforcement World Model Learning (RWML), a self-supervised method that learns action-conditioned world models for LLM-based agents on textual states using sim-to-real gap rewards. Our method aligns simulated next states produced by the model with realized next states observed from the environment, encouraging consistency between internal world simulations and actual environment dynamics in a pre-trained embedding space. Unlike next-state token prediction, which prioritizes token-level fidelity (i.e., reproducing exact wording) over semantic equivalence and can lead to model collapse, our method provides a more robust training signal and is empirically less susceptible to reward hacking than LLM-as-a-judge. We evaluate our method on ALFWorld and $\tau^2$ Bench and observe significant gains over the base model, despite being entirely self-supervised. When combined with task-success rewards, our method outperforms direct task-success reward RL by 6.9 and 5.7 points on ALFWorld and $\tau^2$ Bench respectively, while matching the performance of expert-data training.
Problem

Research questions and friction points this paper is trying to address.

world model
LLM-based agents
action consequences
environment dynamics
reinforcement learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement World Model Learning
sim-to-real gap rewards
action-conditioned world models
self-supervised learning
LLM-based agents
πŸ”Ž Similar Papers
No similar papers found.