Mobile-Agent-v3: Foundamental Agents for GUI Automation

📅 2025-08-20

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This paper addresses three key challenges in cross-platform GUI automation: poor generalization, data scarcity, and difficulty in policy optimization. To tackle them, we propose a self-evolving GUI trajectory generation framework and a trajectory-aware relative policy optimization algorithm, integrating UI element localization, action semantic modeling, and planning-based reasoning to construct an end-to-end cloud-based virtual testing environment. Our approach supports multi-agent collaboration, asynchronous reinforcement learning, and scalable, self-evolving data generation. The resulting GUI-Owl-7B agent achieves 66.4 and 29.4 points on the AndroidWorld and OSWorld benchmarks, respectively; its enhanced variant, Mobile-Agent-v3, further attains 73.3 and 37.7—setting new open-source SOTA records for GUI agents. Key contributions are: (1) the first unified, self-evolving data generation paradigm for both desktop and mobile GUIs; and (2) a trajectory-aware RL mechanism that jointly optimizes action precision and long-horizon planning.

Technology Category

Application Category

📝 Abstract

This paper introduces GUI-Owl, a foundational GUI agent model that achieves state-of-the-art performance among open-source end-to-end models on ten GUI benchmarks across desktop and mobile environments, covering grounding, question answering, planning, decision-making, and procedural knowledge. GUI-Owl-7B achieves 66.4 on AndroidWorld and 29.4 on OSWorld. Building on this, we propose Mobile-Agent-v3, a general-purpose GUI agent framework that further improves performance to 73.3 on AndroidWorld and 37.7 on OSWorld, setting a new state-of-the-art for open-source GUI agent frameworks. GUI-Owl incorporates three key innovations: (1) Large-scale Environment Infrastructure: a cloud-based virtual environment spanning Android, Ubuntu, macOS, and Windows, enabling our Self-Evolving GUI Trajectory Production framework. This generates high-quality interaction data via automated query generation and correctness validation, leveraging GUI-Owl to refine trajectories iteratively, forming a self-improving loop. It supports diverse data pipelines and reduces manual annotation. (2) Diverse Foundational Agent Capabilities: by integrating UI grounding, planning, action semantics, and reasoning patterns, GUI-Owl supports end-to-end decision-making and can act as a modular component in multi-agent systems. (3) Scalable Environment RL: we develop a scalable reinforcement learning framework with fully asynchronous training for real-world alignment. We also introduce Trajectory-aware Relative Policy Optimization (TRPO) for online RL, achieving 34.9 on OSWorld. GUI-Owl and Mobile-Agent-v3 are open-sourced at https://github.com/X-PLUG/MobileAgent.

Problem

Research questions and friction points this paper is trying to address.

Develops GUI agent for automation across desktop and mobile platforms

Creates self-evolving data pipeline to reduce manual annotation needs

Enables end-to-end decision-making through multimodal agent capabilities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cloud-based virtual environment for automated GUI interaction

Integrates UI grounding, planning, and reasoning capabilities

Scalable reinforcement learning with asynchronous training framework

🔎 Similar Papers

Benchmarking Mobile Device Control Agents across Diverse Configurations