Thinking by Doing: Building Efficient World Model Reasoning in LLMs via Multi-turn Interaction

📅 2025-11-28

📈 Citations: 0

✨ Influential: 0

career value

229K/year

🤖 AI Summary

To address the rigidity of reasoning and suppression of active learning in existing LLM-based agent world models, this paper proposes a dynamic reasoning framework grounded in multi-turn interaction. Methodologically, it introduces: (1) a reward rescaling mechanism that adaptively modulates environmental feedback to guide exploratory policy learning; (2) an interaction-frequency annealing strategy that progressively reduces interaction rounds, encouraging the shift from external interaction to internal mental simulation; and (3) action-effectiveness-driven adaptive optimization of reasoning structure. Departing from fixed reasoning paradigms, the framework enables LLMs to autonomously construct and internalize environmental dynamics through embodied action. Empirical evaluation on Sokoban, Maze, and Taxi tasks demonstrates substantial improvements in single-step planning capability. Moreover, the method exhibits strong generalization and sample efficiency on complex scenarios and cross-task reasoning benchmarks.

Technology Category

Application Category

📝 Abstract

Developing robust world model reasoning is crucial for large language model (LLM) agents to plan and interact in complex environments. While multi-turn interaction offers a superior understanding of environmental dynamics via authentic feedback, current approaches often impose a rigid reasoning process, which constrains the model's active learning, ultimately hindering efficient world model reasoning. To address these issues, we explore world-model internalization through efficient interaction and active reasoning (WMAct), which liberates the model from structured reasoning, allowing the model to shape thinking directly through its doing, and achieves effective and efficient world model reasoning with two key mechanisms: (1) a reward rescaling mechanism adjusting outcome reward based on action efficacy to incentivize redundancy reduction and purposeful interaction; (2) an interaction frequency annealing strategy to progressively reduce the maximum allowed interaction turns, which compels the model to condense its learning and internalize environmental dynamics rather than over-relying on environmental cues. Our experiments on Sokoban, Maze, and Taxi show that WMAct yields effective world model reasoning capable of resolving tasks in a single turn that previously required multiple interactions and fosters strong transferability to complex environments, improving performance on a suite of reasoning benchmarks.

Problem

Research questions and friction points this paper is trying to address.

Developing robust world model reasoning for LLM agents in complex environments

Addressing rigid reasoning processes that hinder active learning efficiency

Reducing over-reliance on environmental cues through internalized world models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reward rescaling mechanism reduces redundancy and interaction

Interaction frequency annealing progressively limits maximum turns

Liberates model from rigid reasoning for active learning

🔎 Similar Papers

Semantic Self-Consistency: Enhancing Language Model Reasoning via Semantic Weighting