Mano Report

📅 2025-09-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
GUI automation remains challenged by visual complexity, environmental dynamics, and multi-step reasoning. Existing vision-language model (VLM) approaches suffer from low-resolution input, domain shift, and weak sequential decision-making capabilities. This paper proposes a multimodal foundation model–based GUI agent featuring a three-stage collaborative training pipeline—pretraining, domain-specific fine-tuning, and reinforcement-based refinement—alongside a validation-driven error recovery mechanism. We further construct a high-fidelity simulation environment to generate high-quality interactive data. Our approach innovatively integrates multimodal pretraining, offline and online reinforcement learning, cross-domain transfer, and interpretable action modeling. On the Mind2Web and OSWorld benchmarks, our method achieves task success rates of 82.4% and 76.9%, respectively—significantly surpassing state-of-the-art methods—and marks the first unified framework for robust, recoverable, and multi-step GUI automation.

Technology Category

Application Category

📝 Abstract
Graphical user interfaces (GUIs) are the primary medium for human-computer interaction, yet automating GUI interactions remains challenging due to the complexity of visual elements, dynamic environments, and the need for multi-step reasoning. Existing methods based on vision-language models (VLMs) often suffer from limited resolution, domain mismatch, and insufficient sequential decisionmaking capability. To address these issues, we propose Mano, a robust GUI agent built upon a multi-modal foundation model pre-trained on extensive web and computer system data. Our approach integrates a novel simulated environment for high-fidelity data generation, a three-stage training pipeline (supervised fine-tuning, offline reinforcement learning, and online reinforcement learning), and a verification module for error recovery. Mano demonstrates state-of-the-art performance on multiple GUI benchmarks, including Mind2Web and OSWorld, achieving significant improvements in success rate and operational accuracy. Our work provides new insights into the effective integration of reinforcement learning with VLMs for practical GUI agent deployment, highlighting the importance of domain-specific data, iterative training, and holistic reward design.
Problem

Research questions and friction points this paper is trying to address.

Automating complex GUI interactions with visual elements and dynamic environments
Addressing limitations of vision-language models in resolution and sequential reasoning
Improving GUI agent performance through multi-modal foundation models and reinforcement learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-modal foundation model pre-trained on web data
Three-stage training pipeline with reinforcement learning
Simulated environment for high-fidelity data generation
🔎 Similar Papers
No similar papers found.
Tianyu Fu
Tianyu Fu
Ph.D at Tsinghua University
efficient AILLMsparse computation
A
Anyang Su
DeepMiner-Mano Team, Mininglamp Technology
C
Chenxu Zhao
DeepMiner-Mano Team, Mininglamp Technology
H
Hanning Wang
DeepMiner-Mano Team, Mininglamp Technology
Minghui Wu
Minghui Wu
Zhejiang University City College
Mobile ComputingBig DataMachine LearningSoftware Engineering
Z
Zhe Yu
DeepMiner-Mano Team, Mininglamp Technology
F
Fei Hu
DeepMiner-Mano Team, Mininglamp Technology
Mingjia Shi
Mingjia Shi
Somewhere on the Earth
Learning TheoryData ScienceResource Preserving
W
Wei Dong
DeepMiner-Mano Team, Mininglamp Technology
J
Jiayao Wang
DeepMiner-Mano Team, Mininglamp Technology
Y
Yuyang Chen
DeepMiner-Mano Team, Mininglamp Technology
R
Ruiyang Yu
DeepMiner-Mano Team, Mininglamp Technology
Siran Peng
Siran Peng
CASIA
Computer VisionImage FusionDeepfake Detection
M
Menglin Li
DeepMiner-Mano Team, Mininglamp Technology
N
Nan Huang
DeepMiner-Mano Team, Mininglamp Technology
H
Haitian Wei
DeepMiner-Mano Team, Mininglamp Technology
Jiawei Yu
Jiawei Yu
Xiamen University
SpeechNatural Language Processing
Yi Xin
Yi Xin
California Institute of Technology
Industrial OrganizationEconometrics
X
Xilin Zhao
DeepMiner-Mano Team, Mininglamp Technology
K
Kai Gu
DeepMiner-Mano Team, Mininglamp Technology
P
Ping Jiang
DeepMiner-Mano Team, Mininglamp Technology
Sifan Zhou
Sifan Zhou
Southeast University
RoboticsM/LLMsSpatial AIQuantization
S
Shuo Wang
DeepMiner-Mano Team, Mininglamp Technology