WALT: Web Agents that Learn Tools

📅 2025-10-01

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

Current web agents rely on low-level UI interactions and extensive LLM-based stepwise reasoning, rendering them fragile and poorly generalizable to dynamic webpage layouts and long-horizon tasks. To address this, we propose a website-centric reverse-engineering framework that automatically abstracts latent semantic operations—such as search, filtering, sorting, and posting—into reusable, function-level tools. This enables web agents to transition from pixel- or DOM-element-level manipulation to high-level functional orchestration. Our approach integrates LLM-driven intent understanding, structural DOM analysis, and a novel tool discovery mechanism to identify and invoke website functionality without manual annotation. Evaluated on VisualWebArena and WebArena benchmarks, our method achieves substantial improvements in task success rate while reducing execution steps by 57% and LLM calls by 63%, demonstrating superior efficiency, robustness, and cross-site generalization.

Technology Category

Application Category

📝 Abstract

Web agents promise to automate complex browser tasks, but current methods remain brittle -- relying on step-by-step UI interactions and heavy LLM reasoning that break under dynamic layouts and long horizons. Humans, by contrast, exploit website-provided functionality through high-level operations like search, filter, and sort. We introduce WALT (Web Agents that Learn Tools), a framework that reverse-engineers latent website functionality into reusable invocable tools. Rather than hypothesizing ad-hoc skills, WALT exposes robust implementations of automations already designed into websites -- spanning discovery (search, filter, sort), communication (post, comment, upvote), and content management (create, edit, delete). Tools abstract away low-level execution: instead of reasoning about how to click and type, agents simply call search(query) or create(listing). This shifts the computational burden from fragile step-by-step reasoning to reliable tool invocation. On VisualWebArena and WebArena, WALT achieves higher success with fewer steps and less LLM-dependent reasoning, establishing a robust and generalizable paradigm for browser automation.

Problem

Research questions and friction points this paper is trying to address.

Automating complex browser tasks with robust website functionality tools

Reducing reliance on brittle step-by-step UI interactions and LLM reasoning

Reverse-engineering latent website operations into reusable automation tools

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reverse-engineers website functionality into reusable tools

Abstracts low-level execution through tool invocation

Shifts burden from step-by-step reasoning to tool calls

🔎 Similar Papers

ToolGen: Unified Tool Retrieval and Calling via Generation