What's Missing in Screen-to-Action? Towards a UI-in-the-Loop Paradigm for Multimodal GUI Reasoning

📅 2026-04-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing GUI reasoning methods, which rely on raw screen pixels for decision-making and often lack deep understanding and interpretability of UI elements, leading to task failures. To overcome these issues, we propose the UI-in-the-Loop (UILoop) paradigm, which models GUI reasoning as a closed-loop process involving screens, UI elements, and actions. By leveraging multimodal large language models, UILoop explicitly learns the spatial layout, semantic functions, and actionable properties of UI elements, enabling precise interaction and interpretable reasoning. We introduce the first UI element–centric UI Comprehension task, establish UI Comprehension-Bench—a benchmark comprising 26K samples—and define three evaluation metrics. Experimental results demonstrate that our approach significantly outperforms current state-of-the-art methods in GUI understanding and reasoning tasks.
📝 Abstract
Existing Graphical User Interface (GUI) reasoning tasks remain challenging, particularly in UI understanding. Current methods typically rely on direct screen-based decision-making, which lacks interpretability and overlooks a comprehensive understanding of UI elements, ultimately leading to task failure. To enhance the understanding and interaction with UIs, we propose an innovative GUI reasoning paradigm called UI-in-the-Loop (UILoop). Our approach treats the GUI reasoning task as a cyclic Screen-UI elements-Action process. By enabling Multimodal Large Language Models (MLLMs) to explicitly learn the localization, semantic functions, and practical usage of key UI elements, UILoop achieves precise element discovery and performs interpretable reasoning. Furthermore, we introduce a more challenging UI Comprehension task centered on UI elements with three evaluation metrics. Correspondingly, we contribute a benchmark of 26K samples (UI Comprehension-Bench) to comprehensively evaluate existing methods' mastery of UI elements. Extensive experiments demonstrate that UILoop achieves state-of-the-art UI understanding performance while yielding superior results in GUI reasoning tasks.
Problem

Research questions and friction points this paper is trying to address.

GUI reasoning
UI understanding
multimodal reasoning
screen-to-action
UI elements
Innovation

Methods, ideas, or system contributions that make the work stand out.

UI-in-the-Loop
Multimodal Large Language Models
GUI Reasoning
UI Comprehension
Interpretable Reasoning
🔎 Similar Papers
No similar papers found.
S
Songze Li
Zhejiang University, ZJU-Ant Group Joint Lab of Knowledge Graph
X
Xiaoke Guo
Zhejiang University, ZJU-Ant Group Joint Lab of Knowledge Graph
T
Tianqi Liu
Zhejiang University, ZJU-Ant Group Joint Lab of Knowledge Graph
Biao Yi
Biao Yi
Nankai University
LLM SecurityTrustworthy LLMSteganography
Z
Zhaoyan Gong
Zhejiang University, ZJU-Ant Group Joint Lab of Knowledge Graph
Zhiqiang Liu
Zhiqiang Liu
zhejiang university
H
Huajun Chen
Zhejiang University, ZJU-Ant Group Joint Lab of Knowledge Graph
Wen Zhang
Wen Zhang
Zhejiang University
Knowledge graphRepresentation Learning