See, Point, Refine: Multi-Turn Approach to GUI Grounding with Visual Feedback

📅 2026-04-14

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Existing single-step coordinate prediction methods lack error-correction mechanisms, making pixel-level precise localization challenging in high-density graphical user interfaces (GUIs). This work proposes a multi-round iterative closed-loop localization approach that integrates large language models—such as GPT-5.4, Claude, and Qwen—with a visual feedback loop to establish a self-correcting “observe–click–refine” mechanism. By dynamically optimizing cursor placement through iterative refinement, the method transcends the limitations of conventional one-shot prediction paradigms. Evaluated on multiple complex programming benchmarks, it substantially improves click accuracy and task success rates, thereby enhancing agent robustness and adaptability in dense, dynamic interface environments.

Technology Category

Application Category

📝 Abstract

Computer Use Agents (CUAs) fundamentally rely on graphical user interface (GUI) grounding to translate language instructions into executable screen actions, but editing-level grounding in dense coding interfaces, where sub-pixel accuracy is required to interact with dense IDE elements, remains underexplored. Existing approaches typically rely on single-shot coordinate prediction, which lacks a mechanism for error correction and often fails in high-density interfaces. In this technical report, we conduct an empirical study of pixel-precise cursor localization in coding environments. Instead of a single-step execution, our agent engages in an iterative refinement process, utilizing visual feedback from previous attempts to reach the target element. This closed-loop grounding mechanism allows the agent to self-correct displacement errors and adapt to dynamic UI changes. We evaluate our approach across GPT-5.4, Claude, and Qwen on a suite of complex coding benchmarks, demonstrating that multi-turn refinement significantly outperforms state-of-the-art single-shot models in both click precision and overall task success rate. Our results suggest that iterative visual reasoning is a critical component for the next generation of reliable software engineering agents. Code: https://github.com/microsoft/precision-cua-bench.

Problem

Research questions and friction points this paper is trying to address.

GUI grounding

pixel-precise localization

dense coding interfaces

Computer Use Agents

visual feedback

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-turn refinement

GUI grounding

visual feedback