WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning

📅 2024-11-04
🏛️ arXiv.org
📈 Citations: 3
Influential: 0
📄 PDF
🤖 AI Summary
Open large language models (LLMs) face three key challenges in web-based agent tasks: scarcity of training tasks, sparse feedback signals, and online policy drift. To address these, we propose an end-to-end trainable framework comprising: (1) a novel self-evolving online curriculum generation mechanism that dynamically constructs progressively challenging tasks from failure trajectories; (2) a robust outcome supervision reward model (ORM) to mitigate annotation noise and alleviate reward sparsity; and (3) an adaptive PPO algorithm resilient to distributional shift, enabling stable online learning. Experiments on WebArena-Lite using Llama-3.1-8B and GLM-4-9B achieve success rates of 42.4% and 43.0%, respectively—substantially outperforming GPT-4-Turbo (17.6%) and the open-source SOTA AutoWebGLM (18.2%). Our approach demonstrates significant advances in generalization, robustness, and practical deployability for web agents.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have shown remarkable potential as autonomous agents, particularly in web-based tasks. However, existing LLM web agents heavily rely on expensive proprietary LLM APIs, while open LLMs lack the necessary decision-making capabilities. This paper introduces WebRL, a self-evolving online curriculum reinforcement learning framework designed to train high-performance web agents using open LLMs. WebRL addresses three key challenges in building LLM web agents, including the scarcity of training tasks, sparse feedback signals, and policy distribution drift in online learning. Specifically, WebRL incorporates 1) a self-evolving curriculum that generates new tasks from unsuccessful attempts, 2) a robust outcome-supervised reward model (ORM), and 3) adaptive reinforcement learning strategies to ensure consistent improvements. We apply WebRL to transform open Llama-3.1 and GLM-4 models into proficient web agents. On WebArena-Lite, WebRL improves the success rate of Llama-3.1-8B from 4.8% to 42.4%, and from 6.1% to 43% for GLM-4-9B. These open models significantly surpass the performance of GPT-4-Turbo (17.6%) and GPT-4o (13.9%) and outperform previous state-of-the-art web agents trained on open LLMs (AutoWebGLM, 18.2%). Our findings demonstrate WebRL's effectiveness in bridging the gap between open and proprietary LLM-based web agents, paving the way for more accessible and powerful autonomous web interaction systems.
Problem

Research questions and friction points this paper is trying to address.

Cost Efficiency
Decision-making Capabilities
Online Learning Stability
Innovation

Methods, ideas, or system contributions that make the work stand out.

WebRL
Large Language Models
Self-improvement Learning
🔎 Similar Papers
No similar papers found.
Zehan Qi
Zehan Qi
Tsinghua University
X
Xiao Liu
Tsinghua University and Zhipu AI
I
Iat Long Iong
Tsinghua University
Hanyu Lai
Hanyu Lai
Tsinghua University
machine learningnatural language processing
X
Xueqiao Sun
Tsinghua University
X
Xinyue Yang
Zhipu AI
Jiadai Sun
Jiadai Sun
Zhipu AI
Y
Yu Yang
Zhipu AI
S
Shuntian Yao
Zhipu AI
W
Wei Xu
Tsinghua University
Jie Tang
Jie Tang
UW Madison
Computed Tomography
Yuxiao Dong
Yuxiao Dong
CS, Tsinghua University
Large Language ModelsVision Language ModelsLLM ReasoningLLM AgentGraph Machine Learning