DocOS: Towards Proactive Document-Guided Actions in GUI Agents

📅 2026-05-18
📈 Citations: 0
Influential: 0
📄 PDF

career value

194K/year
🤖 AI Summary
Current GUI agents are limited in handling long-tail tasks due to their reliance on static knowledge and inefficient trial-and-error strategies, which hinder the acquisition of explicit procedural knowledge. This work proposes a novel “active documentation-guided action” paradigm, introducing for the first time the human-like mechanism of consulting documentation to solve problems into GUI agents. The approach enables agents to autonomously retrieve operational guides from open web environments, interpret instructions, and translate them into precise interface actions. To realize this, we develop an end-to-end framework integrating web navigation, document retrieval, language understanding, and action generation, and introduce DocOS—the first interactive benchmark for evaluating such capabilities. Experiments uncover critical bottlenecks in existing agents’ information retrieval and instruction execution, demonstrating the pivotal role of documentation guidance in advancing self-evolving GUI agents.
📝 Abstract
While Graphical User Interface (GUI) agents have shown promising performance in automated device interaction, they primarily depend on static parametric knowledge from pre-training or instruction tuning. This reliance fundamentally limits their ability to handle long-tailed tasks that require explicit procedural knowledge absent from model parameters, often forcing agents to resort to inefficient and brittle trial-and-error exploration. To mitigate this limitation, we introduce \textbf{Proactive Document-Guided Action} for GUI agents in dynamic, open-web environments, a novel paradigm that mirrors human problem-solving by enabling agents to autonomously search for relevant documentation to resolve long-tailed tasks. To evaluate agents' capability in this paradigm, we propose \textbf{DocOS}, a benchmark designed to assess document-guided problem solving in fully interactive environments. DocOS requires agents to autonomously navigate a web browser, locate relevant online documentation, comprehend procedural instructions, and faithfully ground them into executable GUI actions. Extensive experiments reveal that progress is strictly constrained by dual bottlenecks: agents struggle to reliably locate relevant information during proactive search and frequently fail to faithfully ground retrieved instructions into precise actions, pointing toward document-guided interaction as a crucial pathway for enabling self-evolving GUI agents in dynamic environments.
Problem

Research questions and friction points this paper is trying to address.

GUI agents
long-tailed tasks
procedural knowledge
document-guided action
open-web environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Proactive Document-Guided Action
GUI Agents
DocOS
Long-tailed Tasks
Document-grounded Interaction