Learning to Contextualize Web Pages for Enhanced Decision Making by LLM Agents

📅 2025-03-12

📈 Citations: 1

✨ Influential: 0

career value

207K/year

🤖 AI Summary

LLM-based agents frequently fail on real-world web navigation tasks due to difficulties in parsing complex, noisy HTML structures. To address this, we propose LCoW—a novel framework that decouples *web contextualization* from *decision-making* via a lightweight, trainable, structure-aware module. This module transforms raw HTML into semantically enriched, structurally simplified inputs—preserving essential layout and functional cues—while delegating high-level reasoning to any off-the-shelf LLM agent. The contextualization module is plug-and-play compatible with diverse open- and closed-source LLMs and undergoes end-to-end supervised fine-tuning for joint optimization. Experiments demonstrate that LCoW boosts task success rates by 15.6% (closed-source) and 23.7% (open-source) on WorkArena. On WebShop, it achieves state-of-the-art performance using Gemini-1.5-flash—marking the first instance where an automated agent surpasses human expert performance.

Technology Category

Application Category

📝 Abstract

Recent advances in large language models (LLMs) have led to a growing interest in developing LLM-based agents for automating web tasks. However, these agents often struggle with even simple tasks on real-world websites due to their limited capability to understand and process complex web page structures. In this work, we introduce LCoW, a framework for Learning language models to Contextualize complex Web pages into a more comprehensible form, thereby enhancing decision making by LLM agents. LCoW decouples web page understanding from decision making by training a separate contextualization module to transform complex web pages into comprehensible format, which are then utilized by the decision-making agent. We demonstrate that our contextualization module effectively integrates with LLM agents of various scales to significantly enhance their decision-making capabilities in web automation tasks. Notably, LCoW improves the success rates of closed-source LLMs (e.g., Gemini-1.5-flash, GPT-4o, Claude-3.5-Sonnet) by an average of 15.6%, and demonstrates a 23.7% average improvement in success rates for open-source LMs (e.g., Llama-3.1-8B, Llama-3.1-70B) on the WorkArena benchmark. Moreover, the Gemini-1.5-flash agent with LCoW achieves state-of-the-art results on the WebShop benchmark, outperforming human experts. The relevant code materials are available at our project page: https://lcowiclr2025.github.io.

Problem

Research questions and friction points this paper is trying to address.

Enhance LLM agents' decision-making on complex web pages.

Decouple web page understanding from decision-making processes.

Improve success rates of LLMs in web automation tasks.

Innovation

Methods, ideas, or system contributions that make the work stand out.

LCoW framework contextualizes complex web pages

Decouples web page understanding from decision making

Enhances LLM agents' web automation success rates

🔎 Similar Papers

No similar papers found.