🤖 AI Summary
GUI agents exhibit limited robustness when encountering novel UI elements, long-horizon operations, and personalized interaction paths. To address this, we propose an expert-demonstration-driven framework that automatically extracts stepwise instructions from a single high-quality demonstration while strictly aligning with user intent. Our method innovatively integrates a result validator and a dynamic backtracking module to detect execution deviations—such as unexpected pop-ups—in real time and autonomously recover the correct trajectory. It combines action trajectory tracking, multi-granularity validation, and conditional backtracking to significantly enhance reliability and generalization for complex GUI tasks. Evaluated on the OSWorld benchmark, our approach achieves a 60% task success rate, surpassing current state-of-the-art methods. Notably, it is the first to enable end-to-end automation of multiple high-difficulty, cross-application, long-duration tasks.
📝 Abstract
Graphical user interface (GUI) agents have advanced rapidly but still struggle with complex tasks involving novel UI elements, long-horizon actions, and personalized trajectories. In this work, we introduce Instruction Agent, a GUI agent that leverages expert demonstrations to solve such tasks, enabling completion of otherwise difficult workflows. Given a single demonstration, the agent extracts step-by-step instructions and executes them by strictly following the trajectory intended by the user, which avoids making mistakes during execution. The agent leverages the verifier and backtracker modules further to improve robustness. Both modules are critical to understand the current outcome from each action and handle unexpected interruptions(such as pop-up windows) during execution. Our experiments show that Instruction Agent achieves a 60% success rate on a set of tasks in OSWorld that all top-ranked agents failed to complete. The Instruction Agent offers a practical and extensible framework, bridging the gap between current GUI agents and reliable real-world GUI task automation.