GROW: Aligning GRPO with State-Action Modeling for Open-World VLM Agents

📅 2026-05-18
📈 Citations: 0
Influential: 0
📄 PDF

career value

224K/year
🤖 AI Summary
This work addresses the challenge of effectively applying Group Relative Policy Optimization (GRPO) to open-world vision-language model (VLM) agents in multi-turn reinforcement learning, where GRPO’s reliance on complete trajectories leads to excessively long contexts and severe noise interference. To overcome these limitations, the authors propose GROW, a novel framework that successfully adapts GRPO to open-world VLM agents for the first time by decomposing trajectories into state-action samples and computing advantage estimates across these samples. This approach preserves GRPO’s core optimization signal while circumventing constraints imposed by trajectory length and noise. Evaluated on over 800 Minecraft tasks, GROW achieves state-of-the-art performance, demonstrating its effectiveness and scalability in complex, open-ended environments.
📝 Abstract
Recently, vision-language model (VLM) agents have shown promising progress in open-world tasks, where successful task completion often requires multiple turns of visual perception and action execution. However, existing methods still rely primarily on Supervised Fine-Tuning (SFT) with expert demonstrations, while the advanced reinforcement learning (RL) algorithm, specifically Group Relative Policy Optimization (GRPO), has not been effectively employed for multi-turn RL in these tasks because standard GRPO requires full trajectories as training samples which leads to excessively long context and noise. To address this issue, we propose GROW, a RL framework for open-world VLM agents that decomposes collected trajectories into state-action samples, and computes advantages between these samples rather than treating a full trajectory as a single entity. We further provide a surrogate analysis indicating that, even though the grouped samples are conditioned on different local states rather than an identical prompt context, the objective can preserve the core relative policy optimization signal of GRPO under simplifying assumptions. Experiments on more than 800 Minecraft tasks show that our method achieves state-of-the-art (SOTA) performance, demonstrating the effectiveness of our proposed RL framework for open-world VLM agents.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language Model
Open-World Tasks
Reinforcement Learning
GRPO
Multi-turn Interaction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Group Relative Policy Optimization
state-action modeling
open-world VLM agents
trajectory decomposition
reinforcement learning
X
Xiongbin Wu
Shanghai Jiao Tong University; Shanghai Artificial Intelligence Laboratory
Z
Zhihao Luo
Shanghai Artificial Intelligence Laboratory; East China Normal University
S
Shanzhe Lei
Shanghai Artificial Intelligence Laboratory
L
Lechao Zhang
Shanghai Artificial Intelligence Laboratory; East China Normal University
Xuhong Wang
Xuhong Wang
Shanghai Artificial Intelligence Laboratory
LLMKnowledge SystemAI Simulation
Jie Yang
Jie Yang
Shanghai Jiao Tong University
Image ProcessingMedical Image ProcessingPattern Recognition
Z
Zhonglong Zheng
Zhejiang Normal University
Y
Yuanjie Zheng
Shandong Normal University
Xin Tan
Xin Tan
Research Professor, East China Normal University & Shanghai AI Laboratory
3D VisionTrustworthy Embodied AI
W
Wei Liu
Shanghai Jiao Tong University