OS Agents: A Survey on MLLM-based Agents for General Computing Devices Use

📅 2025-08-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper introduces and formally defines the “Operating System Agent” (OS Agent) paradigm: a multimodal large language model (MLLM)-driven agent capable of autonomously executing cross-device, cross-application tasks within GUI-based operating system interfaces. Methodologically, it integrates GUI visual understanding, action-space modeling, hierarchical task planning, and human–agent collaborative interaction into a unified technical framework. Key contributions include: (1) the first systematic conceptualization and delineation of OS Agent capabilities and scope; (2) the establishment of the first fine-grained taxonomy and dedicated evaluation benchmark for OS Agents; (3) the open-sourcing of a dynamic resource repository—comprising annotated data, pretrained models, and modular toolchains—to support reproducible research and development; and (4) a comprehensive survey paper accepted at ACL 2025, providing both theoretical foundations and practical guidelines for academic advancement and industrial deployment.

Technology Category

Application Category

📝 Abstract
The dream to create AI assistants as capable and versatile as the fictional J.A.R.V.I.S from Iron Man has long captivated imaginations. With the evolution of (multi-modal) large language models ((M)LLMs), this dream is closer to reality, as (M)LLM-based Agents using computing devices (e.g., computers and mobile phones) by operating within the environments and interfaces (e.g., Graphical User Interface (GUI)) provided by operating systems (OS) to automate tasks have significantly advanced. This paper presents a comprehensive survey of these advanced agents, designated as OS Agents. We begin by elucidating the fundamentals of OS Agents, exploring their key components including the environment, observation space, and action space, and outlining essential capabilities such as understanding, planning, and grounding. We then examine methodologies for constructing OS Agents, focusing on domain-specific foundation models and agent frameworks. A detailed review of evaluation protocols and benchmarks highlights how OS Agents are assessed across diverse tasks. Finally, we discuss current challenges and identify promising directions for future research, including safety and privacy, personalization and self-evolution. This survey aims to consolidate the state of OS Agents research, providing insights to guide both academic inquiry and industrial development. An open-source GitHub repository is maintained as a dynamic resource to foster further innovation in this field. We present a 9-page version of our work, accepted by ACL 2025, to provide a concise overview to the domain.
Problem

Research questions and friction points this paper is trying to address.

Surveying MLLM-based agents for OS task automation
Exploring agent components like understanding and planning
Addressing challenges in safety, privacy, and personalization
Innovation

Methods, ideas, or system contributions that make the work stand out.

MLLM-based agents automate OS tasks
Focus on GUI interaction understanding
Domain-specific models enhance agent capabilities
🔎 Similar Papers
No similar papers found.
X
Xueyu Hu
Zhejiang University
Tao Xiong
Tao Xiong
Zhejiang University
Biao Yi
Biao Yi
Nankai University
LLM SecurityTrustworthy LLMSteganography
Z
Zishu Wei
Zhejiang University
Ruixuan Xiao
Ruixuan Xiao
Zhejiang Univeristy
Machine LearningNatural Language ProcessingLLM
Yurun Chen
Yurun Chen
Master Student of Science, Tsinghua University
3D vision
Jiasheng Ye
Jiasheng Ye
Fudan University
Large Language ModelsGenerative ModelsAI Scientists
Meiling Tao
Meiling Tao
University of Electronic Science and Technology of China
NLPLLM
Xiangxin Zhou
Xiangxin Zhou
Unknown affiliation
Ziyu Zhao
Ziyu Zhao
University of South Carolina
computer vision. 2D/3D segmentationGenerative 3D reconstruction
Y
Yuhuai Li
Zhejiang University
S
Shengze Xu
The Chinese University of Hong Kong
S
Shenzhi Wang
Tsinghua University
X
Xinchen Xu
Zhejiang University
Shuofei Qiao
Shuofei Qiao
Zhejiang University
AI AgentLarge Language ModelsNatural Language ProcessingKnowledge Graphs
Zhaokai Wang
Zhaokai Wang
Shanghai Jiao Tong University; Shanghai AI Laboratory
Computer VisionAI MusicMLLMs
Kun Kuang
Kun Kuang
Zhejiang University
Causal InferenceData MiningMachine Learning
Tieyong Zeng
Tieyong Zeng
Professor, Director of CMAI, Department of Mathematics, The Chinese University of Hong Kong
Data science
L
Liang Wang
University of Chinese Academy of Sciences, Institute of Automation, Chinese Academy of Sciences
J
Jiwei Li
Zhejiang University
Yuchen Eleanor Jiang
Yuchen Eleanor Jiang
OPPO
natural language processingmachine learning
Wangchunshu Zhou
Wangchunshu Zhou
OPPO & M-A-P
artificial general intelligencelanguage agentslarge language modelsnatural language processing
G
Guoyin Wang
01.AI
K
Keting Yin
Zhejiang University
Zhou Zhao
Zhou Zhao
Zhejiang University
Machine LearningData MiningMultimedia Computing
Hongxia Yang
Hongxia Yang
Professor, HK Polytechnic University
Machine LearningGenerative AICognitive IntelligenceStatistical Modeling
F
Fan Wu
Shanghai Jiao Tong University
S
Shengyu Zhang
Zhejiang University
F
Fei Wu
Zhejiang University