VAGEN: Reinforcing World Model Reasoning for Multi-Turn VLM Agents

📅 2025-10-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Vision-language models (VLMs) exhibit weak world modeling capabilities in partially observable environments due to implicit and entangled state estimation and dynamics learning. Method: We propose an explicit visual state reasoning framework grounded in the POMDP formalism, decoupling state estimation from state transition modeling, designing task-adaptive internal belief representations, and integrating a world model reward with bi-level generalized advantage estimation (Bi-Level GAE) for fine-grained credit assignment. Our approach unifies reinforcement learning, natural language understanding, and structured state representations to build a multi-turn interactive VLM agent. Contribution/Results: Evaluated on five benchmarks, our 3B-parameter model achieves a mean score of 0.82—tripling baseline performance—and significantly outperforms closed-source models including GPT-5, Gemini 2.5 Pro, and Claude 4.5. This work provides the first empirical validation of explicit visual state reasoning as both effective and scalable for enhancing VLM world modeling.

Technology Category

Application Category

📝 Abstract
A key challenge in training Vision-Language Model (VLM) agents, compared to Language Model (LLM) agents, lies in the shift from textual states to complex visual observations. This transition introduces partial observability and demands robust world modeling. We ask: Can VLM agents construct internal world models through explicit visual state reasoning? To address this question, we architecturally enforce and reward the agent's reasoning process via reinforcement learning (RL), formulating it as a Partially Observable Markov Decision Process (POMDP). We find that decomposing the agent's reasoning into State Estimation ("what is the current state?") and Transition Modeling ("what comes next?") is critical for success, as demonstrated through five reasoning strategies. Our investigation into how agents represent internal beliefs reveals that the optimal representation is task-dependent: Natural Language excels at capturing semantic relationships in general tasks, while Structured formats are indispensable for precise manipulation and control. Building on these insights, we design a World Modeling Reward that provides dense, turn-level supervision for accurate state prediction, and introduce Bi-Level General Advantage Estimation (Bi-Level GAE) for turn-aware credit assignment. Through this form of visual state reasoning, a 3B-parameter model achieves a score of 0.82 across five diverse agent benchmarks, representing a 3$ imes$ improvement over its untrained counterpart (0.21) and outperforming proprietary reasoning models such as GPT-5 (0.75), Gemini 2.5 Pro (0.67) and Claude 4.5 (0.62). All experiments are conducted within our VAGEN framework, a scalable system for training and analyzing multi-turn VLM agents in diverse visual environments. Code and data are publicly available at https://vagen-ai.github.io.
Problem

Research questions and friction points this paper is trying to address.

Addressing partial observability in Vision-Language Model agents
Enhancing world model reasoning through reinforcement learning
Developing optimal belief representations for visual state reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement learning enforces visual state reasoning process
World Modeling Reward provides dense turn-level supervision
Bi-Level GAE enables turn-aware credit assignment
🔎 Similar Papers