🤖 AI Summary
Ambiguous user instructions in GUI automation frequently lead to task failures. Method: This paper proposes a self-correcting GUI navigation paradigm with interactive information completion, wherein an agent proactively initiates in-situ follow-up questions during execution to clarify user intent. Contribution/Results: We formally define the “self-correcting GUI navigation” task, introduce Navi-plus—the first navigation-oriented dataset featuring interface-aware question-answer pairs—and design a dual-stream trajectory evaluation framework that jointly models visual states, action sequences, and multi-turn dialogues. Experiments demonstrate that agents equipped with follow-up questioning restore task success rates under ambiguous instructions to levels comparable with those achieved under unambiguous instructions, significantly outperforming conventional one-step execution paradigms. This work establishes a novel human-AI collaborative decision-making framework for GUI agents.
📝 Abstract
Graphical user interfaces (GUI) automation agents are emerging as powerful tools, enabling humans to accomplish increasingly complex tasks on smart devices. However, users often inadvertently omit key information when conveying tasks, which hinders agent performance in the current agent paradigm that does not support immediate user intervention. To address this issue, we introduce a $ extbf{Self-Correction GUI Navigation}$ task that incorporates interactive information completion capabilities within GUI agents. We developed the $ extbf{Navi-plus}$ dataset with GUI follow-up question-answer pairs, alongside a $ extbf{Dual-Stream Trajectory Evaluation}$ method to benchmark this new capability. Our results show that agents equipped with the ability to ask GUI follow-up questions can fully recover their performance when faced with ambiguous user tasks.