PC-Agent: A Hierarchical Multi-Agent Collaboration Framework for Complex Task Automation on PC

📅 2025-02-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address challenges in PC GUI automation—including inaccurate screen perception, strong cross-application workflow coupling, and difficulty decomposing long-horizon instructions—this paper proposes a hierarchical multi-agent collaboration framework. Methodologically: (1) it introduces an Active Perception module to enhance fine-grained screenshot understanding; (2) it establishes a three-level decision architecture (instruction → subtask → action) to decouple complex workflows; and (3) it incorporates a Reflection agent for error-driven dynamic correction. The framework comprises four specialized agents—Manager, Progress, Decision, and Reflection—and is evaluated on PC-Eval, a new benchmark comprising 25 realistic, complex GUI instructions. Experiments demonstrate a 32-percentage-point improvement in task success rate over state-of-the-art methods, significantly advancing the practical deployment of multimodal large language models (MLLMs) in PC GUI agents.

Technology Category

Application Category

📝 Abstract
In the field of MLLM-based GUI agents, compared to smartphones, the PC scenario not only features a more complex interactive environment, but also involves more intricate intra- and inter-app workflows. To address these issues, we propose a hierarchical agent framework named PC-Agent. Specifically, from the perception perspective, we devise an Active Perception Module (APM) to overcome the inadequate abilities of current MLLMs in perceiving screenshot content. From the decision-making perspective, to handle complex user instructions and interdependent subtasks more effectively, we propose a hierarchical multi-agent collaboration architecture that decomposes decision-making processes into Instruction-Subtask-Action levels. Within this architecture, three agents (i.e., Manager, Progress and Decision) are set up for instruction decomposition, progress tracking and step-by-step decision-making respectively. Additionally, a Reflection agent is adopted to enable timely bottom-up error feedback and adjustment. We also introduce a new benchmark PC-Eval with 25 real-world complex instructions. Empirical results on PC-Eval show that our PC-Agent achieves a 32% absolute improvement of task success rate over previous state-of-the-art methods. The code will be publicly available.
Problem

Research questions and friction points this paper is trying to address.

Enhances MLLM-based GUI agent capabilities
Manages complex PC workflows effectively
Improves task success rate significantly
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical multi-agent collaboration
Active Perception Module
Instruction-Subtask-Action decomposition
🔎 Similar Papers
No similar papers found.
Haowei Liu
Haowei Liu
TongYi Lab, Alibaba Group
Multimodal Learning
X
Xi Zhang
Alibaba Group
H
Haiyang Xu
Alibaba Group
Y
Yuyang Wanyan
MAIS, Institute of Automation, Chinese Academy of Sciences, China; School of Artificial Intelligence, University of Chinese Academy of Sciences, China
J
Junyang Wang
Beijing Jiaotong University
M
Ming Yan
Alibaba Group
J
Ji Zhang
Alibaba Group
Chunfeng Yuan
Chunfeng Yuan
National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences
computer visionPattern RecognitionMachine LearningHuman Action RecognitionSparse Representation
Changsheng Xu
Changsheng Xu
Professor, Institute of Automation, Chinese Academy of Sciences
MultimediaComputer vision
W
Weiming Hu
MAIS, Institute of Automation, Chinese Academy of Sciences, China; School of Artificial Intelligence, University of Chinese Academy of Sciences, China; School of Information Science and Technology, ShanghaiTech University, China
F
Fei Huang
Alibaba Group