🤖 AI Summary
This work addresses the challenge of effectively applying Group Relative Policy Optimization (GRPO) to open-world vision-language model (VLM) agents in multi-turn reinforcement learning, where GRPO’s reliance on complete trajectories leads to excessively long contexts and severe noise interference. To overcome these limitations, the authors propose GROW, a novel framework that successfully adapts GRPO to open-world VLM agents for the first time by decomposing trajectories into state-action samples and computing advantage estimates across these samples. This approach preserves GRPO’s core optimization signal while circumventing constraints imposed by trajectory length and noise. Evaluated on over 800 Minecraft tasks, GROW achieves state-of-the-art performance, demonstrating its effectiveness and scalability in complex, open-ended environments.
📝 Abstract
Recently, vision-language model (VLM) agents have shown promising progress in open-world tasks, where successful task completion often requires multiple turns of visual perception and action execution. However, existing methods still rely primarily on Supervised Fine-Tuning (SFT) with expert demonstrations, while the advanced reinforcement learning (RL) algorithm, specifically Group Relative Policy Optimization (GRPO), has not been effectively employed for multi-turn RL in these tasks because standard GRPO requires full trajectories as training samples which leads to excessively long context and noise. To address this issue, we propose GROW, a RL framework for open-world VLM agents that decomposes collected trajectories into state-action samples, and computes advantages between these samples rather than treating a full trajectory as a single entity. We further provide a surrogate analysis indicating that, even though the grouped samples are conditioned on different local states rather than an identical prompt context, the objective can preserve the core relative policy optimization signal of GRPO under simplifying assumptions. Experiments on more than 800 Minecraft tasks show that our method achieves state-of-the-art (SOTA) performance, demonstrating the effectiveness of our proposed RL framework for open-world VLM agents.