WebAgent-R1: Training Web Agents via End-to-End Multi-Turn Reinforcement Learning

📅 2025-05-22

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses long-horizon, multi-turn interactive tasks in dynamic web environments. Methodologically, it introduces the first fully multi-turn reinforcement learning (RL) framework—requiring no supervised fine-tuning—featuring asynchronous online sampling and binary sparse reward signals to drive autonomous learning; a chain-of-thought (CoT)-guided prompting strategy coupled with test-time interaction expansion; and empirical validation of warm-up behavioral cloning and CoT initialization as critical components. Evaluated on WebArena-Lite, the approach boosts success rates from 6.1% to 33.9% for Qwen-2.5-3B and from 8.5% to 44.8% for Llama-3.1-8B—substantially outperforming existing state-of-the-art methods and closed-source models (e.g., OpenAI o3). The core contribution is establishing an end-to-end multi-turn RL paradigm that eliminates reliance on supervised pretraining, thereby offering a novel pathway for web agents to perform long-horizon decision-making.

Technology Category

Application Category

📝 Abstract

While reinforcement learning (RL) has demonstrated remarkable success in enhancing large language models (LLMs), it has primarily focused on single-turn tasks such as solving math problems. Training effective web agents for multi-turn interactions remains challenging due to the complexity of long-horizon decision-making across dynamic web interfaces. In this work, we present WebAgent-R1, a simple yet effective end-to-end multi-turn RL framework for training web agents. It learns directly from online interactions with web environments by asynchronously generating diverse trajectories, entirely guided by binary rewards depending on task success. Experiments on the WebArena-Lite benchmark demonstrate the effectiveness of WebAgent-R1, boosting the task success rate of Qwen-2.5-3B from 6.1% to 33.9% and Llama-3.1-8B from 8.5% to 44.8%, significantly outperforming existing state-of-the-art methods and strong proprietary models such as OpenAI o3. In-depth analyses reveal the effectiveness of the thinking-based prompting strategy and test-time scaling through increased interactions for web tasks. We further investigate different RL initialization policies by introducing two variants, namely WebAgent-R1-Zero and WebAgent-R1-CoT, which highlight the importance of the warm-up training stage (i.e., behavior cloning) and provide insights on incorporating long chain-of-thought (CoT) reasoning in web agents.

Problem

Research questions and friction points this paper is trying to address.

Training web agents for multi-turn interactions via RL

Improving long-horizon decision-making on dynamic web interfaces

Enhancing task success rates with end-to-end RL framework

Innovation

Methods, ideas, or system contributions that make the work stand out.

End-to-end multi-turn RL framework

Asynchronous diverse trajectory generation

Binary reward-guided task learning

🔎 Similar Papers

NNetNav: Unsupervised Learning of Browser Agents Through Environment Interaction in the Wild

2024-10-03Citations: 0

Grounded Language Agent for Product Search via Intelligent Web Interactions

2024-04-16CUSTOMNLP4UCitations: 1

Authors to Follow