Instruction Agent: Enhancing Agent with Expert Demonstration

📅 2025-09-08

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

GUI agents exhibit limited robustness when encountering novel UI elements, long-horizon operations, and personalized interaction paths. To address this, we propose an expert-demonstration-driven framework that automatically extracts stepwise instructions from a single high-quality demonstration while strictly aligning with user intent. Our method innovatively integrates a result validator and a dynamic backtracking module to detect execution deviations—such as unexpected pop-ups—in real time and autonomously recover the correct trajectory. It combines action trajectory tracking, multi-granularity validation, and conditional backtracking to significantly enhance reliability and generalization for complex GUI tasks. Evaluated on the OSWorld benchmark, our approach achieves a 60% task success rate, surpassing current state-of-the-art methods. Notably, it is the first to enable end-to-end automation of multiple high-difficulty, cross-application, long-duration tasks.

Technology Category

Application Category

📝 Abstract

Graphical user interface (GUI) agents have advanced rapidly but still struggle with complex tasks involving novel UI elements, long-horizon actions, and personalized trajectories. In this work, we introduce Instruction Agent, a GUI agent that leverages expert demonstrations to solve such tasks, enabling completion of otherwise difficult workflows. Given a single demonstration, the agent extracts step-by-step instructions and executes them by strictly following the trajectory intended by the user, which avoids making mistakes during execution. The agent leverages the verifier and backtracker modules further to improve robustness. Both modules are critical to understand the current outcome from each action and handle unexpected interruptions(such as pop-up windows) during execution. Our experiments show that Instruction Agent achieves a 60% success rate on a set of tasks in OSWorld that all top-ranked agents failed to complete. The Instruction Agent offers a practical and extensible framework, bridging the gap between current GUI agents and reliable real-world GUI task automation.

Problem

Research questions and friction points this paper is trying to address.

GUI agents struggle with complex tasks and novel elements

Agent leverages expert demonstrations to solve difficult workflows

Handles unexpected interruptions and improves execution robustness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages expert demonstrations for GUI tasks

Uses verifier and backtracker for robustness

Extracts step-by-step instructions from demonstrations

🔎 Similar Papers

Mutual Enhancement of Large Language and Reinforcement Learning Models through Bi-Directional Feedback Mechanisms: A Case Study