What's Missing in Screen-to-Action? Towards a UI-in-the-Loop Paradigm for Multimodal GUI Reasoning

📅 2026-04-08

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

This work addresses the limitations of existing GUI reasoning methods, which rely on raw screen pixels for decision-making and often lack deep understanding and interpretability of UI elements, leading to task failures. To overcome these issues, we propose the UI-in-the-Loop (UILoop) paradigm, which models GUI reasoning as a closed-loop process involving screens, UI elements, and actions. By leveraging multimodal large language models, UILoop explicitly learns the spatial layout, semantic functions, and actionable properties of UI elements, enabling precise interaction and interpretable reasoning. We introduce the first UI element–centric UI Comprehension task, establish UI Comprehension-Bench—a benchmark comprising 26K samples—and define three evaluation metrics. Experimental results demonstrate that our approach significantly outperforms current state-of-the-art methods in GUI understanding and reasoning tasks.

Technology Category

Application Category

📝 Abstract

Existing Graphical User Interface (GUI) reasoning tasks remain challenging, particularly in UI understanding. Current methods typically rely on direct screen-based decision-making, which lacks interpretability and overlooks a comprehensive understanding of UI elements, ultimately leading to task failure. To enhance the understanding and interaction with UIs, we propose an innovative GUI reasoning paradigm called UI-in-the-Loop (UILoop). Our approach treats the GUI reasoning task as a cyclic Screen-UI elements-Action process. By enabling Multimodal Large Language Models (MLLMs) to explicitly learn the localization, semantic functions, and practical usage of key UI elements, UILoop achieves precise element discovery and performs interpretable reasoning. Furthermore, we introduce a more challenging UI Comprehension task centered on UI elements with three evaluation metrics. Correspondingly, we contribute a benchmark of 26K samples (UI Comprehension-Bench) to comprehensively evaluate existing methods' mastery of UI elements. Extensive experiments demonstrate that UILoop achieves state-of-the-art UI understanding performance while yielding superior results in GUI reasoning tasks.

Problem

Research questions and friction points this paper is trying to address.

GUI reasoning

UI understanding

multimodal reasoning

screen-to-action

UI elements

Innovation

Methods, ideas, or system contributions that make the work stand out.

UI-in-the-Loop

Multimodal Large Language Models

GUI Reasoning