Visual grounding for desktop graphical user interfaces

📅 2024-05-05

📈 Citations: 1

✨ Influential: 0

career value

179K/year

🤖 AI Summary

Existing vision grounding methods primarily target real-world images and exhibit poor generalization to synthetic interfaces—such as graphical user interfaces (GUIs)—hindering the automation of AI agent interactions with desktop applications. This paper introduces Instruction-driven GUI Visual Grounding (IVG), the first task dedicated to localizing GUI elements in desktop screenshots via natural language instructions. Methodologically, we propose an LLM-guided three-stage detection architecture integrated with a dual-path multimodal foundation model framework, synergistically combining cross-modal alignment and coordinate regression. Evaluated on a GUI-specific benchmark, our approach substantially outperforms existing baselines, achieving significant gains in localization accuracy. This work provides critical technical support for AI agents to comprehend and autonomously operate real desktop environments, thereby advancing automated software testing, accessibility services, and human–computer interaction.

Technology Category

Application Category

📝 Abstract

Most instance perception and image understanding solutions focus mainly on natural images. However, applications for synthetic images, and more specifically, images of Graphical User Interfaces (GUI) remain limited. This hinders the development of autonomous computer-vision-powered Artificial Intelligence (AI) agents. In this work, we present Instruction Visual Grounding or IVG, a multi-modal solution for object identification in a GUI. More precisely, given a natural language instruction and GUI screen, IVG locates the coordinates of the element on the screen where the instruction would be executed. To this end, we develop two methods. The first method is a three-part architecture that relies on a combination of a Large Language Model (LLM) and an object detection model. The second approach uses a multi-modal foundation model.

Problem

Research questions and friction points this paper is trying to address.

Enabling AI to understand and interact with GUI elements

Developing visual grounding methods for synthetic images like GUIs

Improving automation in software testing and accessibility via IVG

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines LLM, object detection, and OCR

Uses multimodal end-to-end grounding architecture

Introduces Central Point Validation metric

🔎 Similar Papers

Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents