🤖 AI Summary
Existing vision grounding methods primarily target real-world images and exhibit poor generalization to synthetic interfaces—such as graphical user interfaces (GUIs)—hindering the automation of AI agent interactions with desktop applications. This paper introduces Instruction-driven GUI Visual Grounding (IVG), the first task dedicated to localizing GUI elements in desktop screenshots via natural language instructions. Methodologically, we propose an LLM-guided three-stage detection architecture integrated with a dual-path multimodal foundation model framework, synergistically combining cross-modal alignment and coordinate regression. Evaluated on a GUI-specific benchmark, our approach substantially outperforms existing baselines, achieving significant gains in localization accuracy. This work provides critical technical support for AI agents to comprehend and autonomously operate real desktop environments, thereby advancing automated software testing, accessibility services, and human–computer interaction.
📝 Abstract
Most instance perception and image understanding solutions focus mainly on natural images. However, applications for synthetic images, and more specifically, images of Graphical User Interfaces (GUI) remain limited. This hinders the development of autonomous computer-vision-powered Artificial Intelligence (AI) agents. In this work, we present Instruction Visual Grounding or IVG, a multi-modal solution for object identification in a GUI. More precisely, given a natural language instruction and GUI screen, IVG locates the coordinates of the element on the screen where the instruction would be executed. To this end, we develop two methods. The first method is a three-part architecture that relies on a combination of a Large Language Model (LLM) and an object detection model. The second approach uses a multi-modal foundation model.