OmegaUse: Building a General-Purpose GUI Agent for Autonomous Task Execution

📅 2026-01-28
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of developing general-purpose GUI agents capable of autonomously performing real-world tasks across mobile and desktop platforms, a goal hindered by the scarcity of high-quality interaction data and the lack of effective training methodologies. To overcome these limitations, the authors propose an automated synthetic data generation framework that integrates bottom-up exploration with top-down task-driven generation to produce high-fidelity, cross-platform interaction trajectories. They further introduce a decoupled two-stage training paradigm—comprising supervised fine-tuning (SFT) followed by Generalized Reinforcement Learning with Policy Optimization (GRPO)—leveraging a Mixture-of-Experts backbone and cross-platform GUI understanding techniques. The resulting model achieves state-of-the-art performance on multiple benchmarks, including ScreenSpot-V2 (96.3%), AndroidControl (79.1%), and the newly introduced OS-Nav suite (ChiM-Nav 74.24%, Ubu-Nav 55.9%), demonstrating the first unified architecture for high-performance cross-device autonomous operation.

Technology Category

Application Category

📝 Abstract
Graphical User Interface (GUI) agents show great potential for enabling foundation models to complete real-world tasks, revolutionizing human-computer interaction and improving human productivity. In this report, we present OmegaUse, a general-purpose GUI agent model for autonomous task execution on both mobile and desktop platforms, supporting computer-use and phone-use scenarios. Building an effective GUI agent model relies on two factors: (1) high-quality data and (2) effective training methods. To address these, we introduce a carefully engineered data-construction pipeline and a decoupled training paradigm. For data construction, we leverage rigorously curated open-source datasets and introduce a novel automated synthesis framework that integrates bottom-up autonomous exploration with top-down taxonomy-guided generation to create high-fidelity synthetic data. For training, to better leverage these data, we adopt a two-stage strategy: Supervised Fine-Tuning (SFT) to establish fundamental interaction syntax, followed by Group Relative Policy Optimization (GRPO) to improve spatial grounding and sequential planning. To balance computational efficiency with agentic reasoning capacity, OmegaUse is built on a Mixture-of-Experts (MoE) backbone. To evaluate cross-terminal capabilities in an offline setting, we introduce OS-Nav, a benchmark suite spanning multiple operating systems: ChiM-Nav, targeting Chinese Android mobile environments, and Ubu-Nav, focusing on routine desktop interactions on Ubuntu. Extensive experiments show that OmegaUse is highly competitive across established GUI benchmarks, achieving a state-of-the-art (SOTA) score of 96.3% on ScreenSpot-V2 and a leading 79.1% step success rate on AndroidControl. OmegaUse also performs strongly on OS-Nav, reaching 74.24% step success on ChiM-Nav and 55.9% average success on Ubu-Nav.
Problem

Research questions and friction points this paper is trying to address.

GUI agent
autonomous task execution
cross-platform interaction
human-computer interaction
general-purpose agent
Innovation

Methods, ideas, or system contributions that make the work stand out.

GUI Agent
Synthetic Data Generation
Mixture-of-Experts (MoE)
Two-stage Training
Cross-platform Benchmarking
🔎 Similar Papers
No similar papers found.
Le Zhang
Le Zhang
Baidu Research
Data Mining
Y
Yixiong Xiao
Baidu Frontier Research Department
X
Xinjiang Lu
Baidu Frontier Research Department
J
Jingjia Cao
Baidu Frontier Research Department
Y
Yusai Zhao
Baidu Frontier Research Department
J
Jingbo Zhou
Baidu Frontier Research Department
L
Lang An
Baidu Frontier Research Department
Z
Zikan Feng
Baidu Frontier Research Department
W
Wanxiang Sha
Baidu Frontier Research Department
Y
Yu Shi
Baidu Frontier Research Department
Congxi Xiao
Congxi Xiao
USTC
Jian Xiong
Jian Xiong
School of Business Administration, Southwestern University of Finance and Economics
Multi-objective evolutionary optimizationMachine learningData MiningDecision support systemsProject planning and schedul
Y
Yankai Zhang
Baidu Frontier Research Department
H
Hua Wu
Baidu Frontier Research Department
Haifeng Wang
Haifeng Wang
Baidu
NLPMTSearchSpeechData Mining