Reinforcement World Model Learning for LLM-based Agents

📅 2026-02-05

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This work addresses the limitations of current large language model (LLM)-based agents in predicting action outcomes and adapting to dynamic environments, which stem from an inadequate capacity for world modeling. The authors propose a self-supervised approach that constructs an action-conditional world model by aligning the semantic representations of simulated and real environment states within the pretrained language model’s embedding space. Departing from conventional token-prediction paradigms, the method introduces a reward mechanism based on the sim-to-real gap, effectively mitigating issues of model collapse and reward hacking. Experimental results on ALFWorld and τ² Bench demonstrate substantial improvements over baseline methods; when combined with task-success rewards, the approach achieves performance gains of 6.9 and 5.7 percentage points, respectively, matching the level of models trained on expert demonstrations.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have achieved strong performance in language-centric tasks. However, in agentic settings, LLMs often struggle to anticipate action consequences and adapt to environment dynamics, highlighting the need for world-modeling capabilities in LLM-based agents. We propose Reinforcement World Model Learning (RWML), a self-supervised method that learns action-conditioned world models for LLM-based agents on textual states using sim-to-real gap rewards. Our method aligns simulated next states produced by the model with realized next states observed from the environment, encouraging consistency between internal world simulations and actual environment dynamics in a pre-trained embedding space. Unlike next-state token prediction, which prioritizes token-level fidelity (i.e., reproducing exact wording) over semantic equivalence and can lead to model collapse, our method provides a more robust training signal and is empirically less susceptible to reward hacking than LLM-as-a-judge. We evaluate our method on ALFWorld and $\tau^2$ Bench and observe significant gains over the base model, despite being entirely self-supervised. When combined with task-success rewards, our method outperforms direct task-success reward RL by 6.9 and 5.7 points on ALFWorld and $\tau^2$ Bench respectively, while matching the performance of expert-data training.

Problem

Research questions and friction points this paper is trying to address.

world model

LLM-based agents

action consequences

environment dynamics

reinforcement learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement World Model Learning

sim-to-real gap rewards

action-conditioned world models