🤖 AI Summary
To address challenges in PC GUI automation—including inaccurate screen perception, strong cross-application workflow coupling, and difficulty decomposing long-horizon instructions—this paper proposes a hierarchical multi-agent collaboration framework. Methodologically: (1) it introduces an Active Perception module to enhance fine-grained screenshot understanding; (2) it establishes a three-level decision architecture (instruction → subtask → action) to decouple complex workflows; and (3) it incorporates a Reflection agent for error-driven dynamic correction. The framework comprises four specialized agents—Manager, Progress, Decision, and Reflection—and is evaluated on PC-Eval, a new benchmark comprising 25 realistic, complex GUI instructions. Experiments demonstrate a 32-percentage-point improvement in task success rate over state-of-the-art methods, significantly advancing the practical deployment of multimodal large language models (MLLMs) in PC GUI agents.
📝 Abstract
In the field of MLLM-based GUI agents, compared to smartphones, the PC scenario not only features a more complex interactive environment, but also involves more intricate intra- and inter-app workflows. To address these issues, we propose a hierarchical agent framework named PC-Agent. Specifically, from the perception perspective, we devise an Active Perception Module (APM) to overcome the inadequate abilities of current MLLMs in perceiving screenshot content. From the decision-making perspective, to handle complex user instructions and interdependent subtasks more effectively, we propose a hierarchical multi-agent collaboration architecture that decomposes decision-making processes into Instruction-Subtask-Action levels. Within this architecture, three agents (i.e., Manager, Progress and Decision) are set up for instruction decomposition, progress tracking and step-by-step decision-making respectively. Additionally, a Reflection agent is adopted to enable timely bottom-up error feedback and adjustment. We also introduce a new benchmark PC-Eval with 25 real-world complex instructions. Empirical results on PC-Eval show that our PC-Agent achieves a 32% absolute improvement of task success rate over previous state-of-the-art methods. The code will be publicly available.