UI-Ins: Enhancing GUI Grounding with Multi-Perspective Instruction-as-Reasoning

📅 2025-10-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The core challenge in GUI grounding lies in the significant impact of natural language instruction diversity and quality on UI element localization accuracy, whereas existing methods typically treat instructions as static intent proxies. To address this, we propose the “Instruction-as-Reasoning” paradigm, which models instructions as dynamic reasoning paths that guide the model to autonomously select and compose optimal chains of thought. Our approach employs a two-stage training strategy: supervised fine-tuning (SFT) on diverse synthetic instructions, followed by reinforcement learning (RL) to optimize path-level decision making. The resulting model, UI-Ins-32B, achieves state-of-the-art performance across five benchmarks, attaining 87.3% accuracy on UI-I2E-Bench. Its lightweight variant, UI-Ins-7B, achieves a 74.1% success rate on AndroidWorld, demonstrating substantial improvements in instruction understanding robustness and cross-task generalization.

Technology Category

Application Category

📝 Abstract
GUI grounding, which maps natural-language instructions to actionable UI elements, is a core capability of GUI agents. Prior works largely treats instructions as a static proxy for user intent, overlooking the impact of instruction diversity and quality on grounding performance. Through a careful investigation of existing grounding datasets, we find a 23.3% flaw rate in their instructions and show that inference-time exploitation of instruction diversity yields up to a substantial 76% relative performance improvement. In this paper, we introduce the Instruction-as-Reasoning paradigm, treating instructions as dynamic analytical pathways that offer distinct perspectives and enabling the model to select the most effective pathway during reasoning. To achieve this, we propose a two-stage training framework: supervised fine-tuning (SFT) on synthesized, diverse instructions to instill multi-perspective reasoning, followed by reinforcement learning (RL) to optimize pathway selection and composition. Our resulting models, UI-Ins-7B and UI-Ins-32B, achieve state-of-the-art results on five challenging grounding benchmarks and exhibit emergent reasoning, selectively composing and synthesizing novel instruction pathways at inference. In particular, UI-Ins-32B attains the best grounding accuracy, scoring 87.3% on UI-I2E-Bench, 57.0% on ScreenSpot-Pro, and 84.9% on MMBench-GUI L2. Furthermore, our model demonstrates strong agentic potential, achieving a 74.1% success rate on AndroidWorld using UI-Ins-7B as the executor. Our in-depth analysis reveals additional insights such as how reasoning can be formulated to enhance rather than hinder grounding performance, and how our method mitigates policy collapse in the SFT+RL framework. All code and model checkpoints will be publicly released in https://github.com/alibaba/UI-Ins.
Problem

Research questions and friction points this paper is trying to address.

Improving GUI grounding accuracy for natural language instructions
Addressing instruction diversity flaws in existing grounding datasets
Enhancing reasoning through multi-perspective instruction pathways
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces Instruction-as-Reasoning paradigm for GUI grounding
Uses two-stage training with SFT and reinforcement learning
Enables dynamic selection of optimal instruction pathways
🔎 Similar Papers
No similar papers found.
L
Liangyu Chen
Renmin University of China
Hanzhang Zhou
Hanzhang Zhou
Nanyang Technological University
Large Language ModelsMechanistic InterpretabilityNatural Language Processing
C
Chenglin Cai
Tongyi Lab, Alibaba Group
Jianan Zhang
Jianan Zhang
Assistant Professor, Peking University
communication networksoptimizationnetworked intelligence
P
Panrong Tong
Tongyi Lab, Alibaba Group
Quyu Kong
Quyu Kong
Alibaba Cloud
Multimodal LLMInformation Diffusion ModelingMachine Learning
X
Xu Zhang
Tongyi Lab, Alibaba Group
C
Chen Liu
Tongyi Lab, Alibaba Group
Y
Yuqi Liu
CUHK
W
Wenxuan Wang
Renmin University of China
Y
Yue Wang
Tongyi Lab, Alibaba Group
Qin Jin
Qin Jin
中国人民大学信息学院
人工智能
S
Steven Hoi
Tongyi Lab, Alibaba Group