UFO2: The Desktop AgentOS

📅 2025-04-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current computer-using agents (CUAs) suffer from shallow OS integration, poor robustness due to screenshot-based perception, and disruptive execution interfering with user activities—hindering real-world deployment. This paper introduces AgentOS, a multi-agent framework for Windows, featuring a novel HostAgent–AppAgent collaborative architecture. It unifies UI Automation (UIA) and vision-based parsing for hybrid control-perception, employs a Picture-in-Picture (PiP) virtual desktop to enable non-intrusive concurrent execution, and incorporates speculative multi-step action planning to reduce LLM invocation overhead. By unifying GUI and API actions into a single action layer and tightly coupling them with multimodal LLMs, AgentOS achieves end-to-end natural-language-driven automation across 20+ real-world Windows applications. Experiments demonstrate substantial improvements in task accuracy and cross-application robustness, advancing CUAs from proof-of-concept prototypes toward practical deployment.

Technology Category

Application Category

📝 Abstract
Recent Computer-Using Agents (CUAs), powered by multimodal large language models (LLMs), offer a promising direction for automating complex desktop workflows through natural language. However, most existing CUAs remain conceptual prototypes, hindered by shallow OS integration, fragile screenshot-based interaction, and disruptive execution. We present UFO2, a multiagent AgentOS for Windows desktops that elevates CUAs into practical, system-level automation. UFO2 features a centralized HostAgent for task decomposition and coordination, alongside a collection of application-specialized AppAgent equipped with native APIs, domain-specific knowledge, and a unified GUI--API action layer. This architecture enables robust task execution while preserving modularity and extensibility. A hybrid control detection pipeline fuses Windows UI Automation (UIA) with vision-based parsing to support diverse interface styles. Runtime efficiency is further enhanced through speculative multi-action planning, reducing per-step LLM overhead. Finally, a Picture-in-Picture (PiP) interface enables automation within an isolated virtual desktop, allowing agents and users to operate concurrently without interference. We evaluate UFO2 across over 20 real-world Windows applications, demonstrating substantial improvements in robustness and execution accuracy over prior CUAs. Our results show that deep OS integration unlocks a scalable path toward reliable, user-aligned desktop automation.
Problem

Research questions and friction points this paper is trying to address.

Enhances desktop workflow automation with deep OS integration
Overcomes fragile interaction via hybrid GUI--API action layer
Enables concurrent user-agent operation via virtual desktop
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multiagent AgentOS with centralized HostAgent and AppAgents
Hybrid control fusing UIA and vision-based parsing
Picture-in-Picture interface for concurrent user-agent operation
🔎 Similar Papers
No similar papers found.
Chaoyun Zhang
Chaoyun Zhang
Microsoft
GUI AgentLLMCausal InferenceAIOpsSpatio-temporal Modelling
H
He Huang
Microsoft
C
Chiming Ni
ZJU-UIUC Institute
J
Jian Mu
Nanjing University
Si Qin
Si Qin
Microsoft
LLM AgentCloud IntelligenceAIOpsArray Signal ProcessingRadar Signal Processing
Shilin He
Shilin He
Microsoft Research
LLMSoftware EngineeringNLP
L
Lu Wang
Microsoft
F
Fangkai Yang
Microsoft
P
Pu Zhao
Microsoft
C
Chao Du
Microsoft
L
Liqun Li
Microsoft
Y
Yu Kang
Microsoft
Zhao Jiang
Zhao Jiang
Stony Brook University
StrokeMRI
S
Suzhen Zheng
Microsoft
R
Rujia Wang
Microsoft
Jiaxu Qian
Jiaxu Qian
Peking University
Minghua Ma
Minghua Ma
Microsoft
AIOpsCloud Intelligence
J
Jian-Guang Lou
Microsoft
Qingwei Lin
Qingwei Lin
Microsoft
S
Saravan Rajmohan
Microsoft
Dongmei Zhang
Dongmei Zhang
Microsoft Research
Software EngineeringMachine LearningInformation Visualization