GAIR: GUI Automation via Information-Joint Reasoning and Group Reflection

📅 2025-12-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
GUI automation tasks exhibit high diversity, leading to heterogeneous model capabilities and tight coupling with domain-specific expertise. Method: This paper proposes a multi-model collaborative intelligent agent framework that decouples capabilities through a joint inference architecture integrating a general-purpose large language model with multiple GUI-specialized models. It introduces a novel two-phase collaboration paradigm—“joint information reasoning” followed by “grouped reflection”: during reasoning, it fuses multi-source GUI state perception and action planning; when information is insufficient, it triggers instruction-driven grouped reflection to guide complementary, high-value information acquisition. Contribution/Results: Experiments demonstrate significant improvements in task completion rate and action accuracy across multiple GUI benchmarks. The framework exhibits superior robustness compared to single-model baselines and conventional ensemble methods.

Technology Category

Application Category

📝 Abstract
Building AI systems for GUI automation task has attracted remarkable research efforts, where MLLMs are leveraged for processing user requirements and give operations. However, GUI automation includes a wide range of tasks, from document processing to online shopping, from CAD to video editing. Diversity between particular tasks requires MLLMs for GUI automation to have heterogeneous capabilities and master multidimensional expertise, raising problems on constructing such a model. To address such challenge, we propose GAIR: GUI Automation via Information-Joint Reasoning and Group Reflection, a novel MLLM-based GUI automation agent framework designed for integrating knowledge and combining capabilities from heterogeneous models to build GUI automation agent systems with higher performance. Since different GUI-specific MLLMs are trained on different dataset and thus have different strengths, GAIR introduced a general-purpose MLLM for jointly processing the information from multiple GUI-specific models, further enhancing performance of the agent framework. The general-purpose MLLM also serves as decision maker, trying to execute a reasonable operation based on previously gathered information. When the general-purpose model thinks that there isn't sufficient information for a reasonable decision, GAIR would transit into group reflection status, where the general-purpose model would provide GUI-specific models with different instructions and hints based on their strengths and weaknesses, driving them to gather information with more significance and accuracy that can support deeper reasoning and decision. We evaluated the effectiveness and reliability of GAIR through extensive experiments on GUI benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Integrates heterogeneous models for diverse GUI automation tasks
Enhances decision-making via joint reasoning and group reflection
Addresses insufficient information in MLLM-based GUI automation systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates knowledge from heterogeneous GUI-specific MLLMs
Uses a general-purpose MLLM for joint reasoning and decision-making
Implements group reflection to gather more accurate information
🔎 Similar Papers
No similar papers found.
Z
Zishu Wei
Zhejiang University
Q
Qixiang Ma
Zhejiang University
X
Xavier Hu
Zhejiang University
Yuhang Liu
Yuhang Liu
The University of Adelaide
Representation LearningLLMsLatent Variable ModelsResponsible AI
Hui Zang
Hui Zang
UC Davis, Sprint, Guavus Inc., Google.
AInetworking
Y
Yudong Zhao
Huawei Technologies Ltd.
T
Tao Wang
Huawei Technologies Ltd.
S
Shengyu Zhang
Zhejiang University
F
Fei Wu
Zhejiang University