OS Agents: A Survey on MLLM-based Agents for General Computing Devices Use

📅 2025-08-06

📈 Citations: 0

✨ Influential: 0

career value

247K/year

🤖 AI Summary

This paper introduces and formally defines the “Operating System Agent” (OS Agent) paradigm: a multimodal large language model (MLLM)-driven agent capable of autonomously executing cross-device, cross-application tasks within GUI-based operating system interfaces. Methodologically, it integrates GUI visual understanding, action-space modeling, hierarchical task planning, and human–agent collaborative interaction into a unified technical framework. Key contributions include: (1) the first systematic conceptualization and delineation of OS Agent capabilities and scope; (2) the establishment of the first fine-grained taxonomy and dedicated evaluation benchmark for OS Agents; (3) the open-sourcing of a dynamic resource repository—comprising annotated data, pretrained models, and modular toolchains—to support reproducible research and development; and (4) a comprehensive survey paper accepted at ACL 2025, providing both theoretical foundations and practical guidelines for academic advancement and industrial deployment.

Technology Category

Application Category

📝 Abstract

The dream to create AI assistants as capable and versatile as the fictional J.A.R.V.I.S from Iron Man has long captivated imaginations. With the evolution of (multi-modal) large language models ((M)LLMs), this dream is closer to reality, as (M)LLM-based Agents using computing devices (e.g., computers and mobile phones) by operating within the environments and interfaces (e.g., Graphical User Interface (GUI)) provided by operating systems (OS) to automate tasks have significantly advanced. This paper presents a comprehensive survey of these advanced agents, designated as OS Agents. We begin by elucidating the fundamentals of OS Agents, exploring their key components including the environment, observation space, and action space, and outlining essential capabilities such as understanding, planning, and grounding. We then examine methodologies for constructing OS Agents, focusing on domain-specific foundation models and agent frameworks. A detailed review of evaluation protocols and benchmarks highlights how OS Agents are assessed across diverse tasks. Finally, we discuss current challenges and identify promising directions for future research, including safety and privacy, personalization and self-evolution. This survey aims to consolidate the state of OS Agents research, providing insights to guide both academic inquiry and industrial development. An open-source GitHub repository is maintained as a dynamic resource to foster further innovation in this field. We present a 9-page version of our work, accepted by ACL 2025, to provide a concise overview to the domain.

Problem

Research questions and friction points this paper is trying to address.

Surveying MLLM-based agents for OS task automation

Exploring agent components like understanding and planning

Addressing challenges in safety, privacy, and personalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

MLLM-based agents automate OS tasks

Focus on GUI interaction understanding

Domain-specific models enhance agent capabilities

🔎 Similar Papers

Systematic Categorization, Construction and Evaluation of New Attacks against Multi-modal Mobile GUI Agents