🤖 AI Summary
This work addresses the “grounding gap” in skill representation that hinders current autonomous web agents during long-horizon tasks—where textual skills are non-executable and code-based skills lack interpretability, impeding error recovery and adaptation. To bridge this gap, the authors propose WebXSkill, a novel framework that unifies executability and step-level semantic understanding within a single skill representation by pairing parameterized action programs with natural language instructions, yielding skills that are both executable and interpretable. The framework comprises three stages: skill extraction, organization, and deployment, incorporating large language model–based trajectory mining, parameterized abstraction, URL-context graph indexing, and a dual-mode deployment mechanism (grounded/guided). Evaluated on WebArena and WebVoyager benchmarks, WebXSkill improves task success rates by 9.8 and 12.9 percentage points, respectively, substantially outperforming existing approaches.
📝 Abstract
Autonomous web agents powered by large language models (LLMs) have shown promise in completing complex browser tasks, yet they still struggle with long-horizon workflows. A key bottleneck is the grounding gap in existing skill formulations: textual workflow skills provide natural language guidance but cannot be directly executed, while code-based skills are executable but opaque to the agent, offering no step-level understanding for error recovery or adaptation. We introduce WebXSkill, a framework that bridges this gap with executable skills, each pairing a parameterized action program with step-level natural language guidance, enabling both direct execution and agent-driven adaptation. WebXSkill operates in three stages: skill extraction mines reusable action subsequences from readily available synthetic agent trajectories and abstracts them into parameterized skills, skill organization indexes skills into a URL-based graph for context-aware retrieval, and skill deployment exposes two complementary modes, grounded mode for fully automated multi-step execution and guided mode where skills serve as step-by-step instructions that the agent follows with its native planning. On WebArena and WebVoyager, WebXSkill improves task success rate by up to 9.8 and 12.9 points over the baseline, respectively, demonstrating the effectiveness of executable skills for web agents. The code is publicly available at https://github.com/aiming-lab/WebXSkill.