🤖 AI Summary
LLM-based agents frequently fail on real-world web navigation tasks due to difficulties in parsing complex, noisy HTML structures. To address this, we propose LCoW—a novel framework that decouples *web contextualization* from *decision-making* via a lightweight, trainable, structure-aware module. This module transforms raw HTML into semantically enriched, structurally simplified inputs—preserving essential layout and functional cues—while delegating high-level reasoning to any off-the-shelf LLM agent. The contextualization module is plug-and-play compatible with diverse open- and closed-source LLMs and undergoes end-to-end supervised fine-tuning for joint optimization. Experiments demonstrate that LCoW boosts task success rates by 15.6% (closed-source) and 23.7% (open-source) on WorkArena. On WebShop, it achieves state-of-the-art performance using Gemini-1.5-flash—marking the first instance where an automated agent surpasses human expert performance.
📝 Abstract
Recent advances in large language models (LLMs) have led to a growing interest in developing LLM-based agents for automating web tasks. However, these agents often struggle with even simple tasks on real-world websites due to their limited capability to understand and process complex web page structures. In this work, we introduce LCoW, a framework for Learning language models to Contextualize complex Web pages into a more comprehensible form, thereby enhancing decision making by LLM agents. LCoW decouples web page understanding from decision making by training a separate contextualization module to transform complex web pages into comprehensible format, which are then utilized by the decision-making agent. We demonstrate that our contextualization module effectively integrates with LLM agents of various scales to significantly enhance their decision-making capabilities in web automation tasks. Notably, LCoW improves the success rates of closed-source LLMs (e.g., Gemini-1.5-flash, GPT-4o, Claude-3.5-Sonnet) by an average of 15.6%, and demonstrates a 23.7% average improvement in success rates for open-source LMs (e.g., Llama-3.1-8B, Llama-3.1-70B) on the WorkArena benchmark. Moreover, the Gemini-1.5-flash agent with LCoW achieves state-of-the-art results on the WebShop benchmark, outperforming human experts. The relevant code materials are available at our project page: https://lcowiclr2025.github.io.