🤖 AI Summary
Existing single-step coordinate prediction methods lack error-correction mechanisms, making pixel-level precise localization challenging in high-density graphical user interfaces (GUIs). This work proposes a multi-round iterative closed-loop localization approach that integrates large language models—such as GPT-5.4, Claude, and Qwen—with a visual feedback loop to establish a self-correcting “observe–click–refine” mechanism. By dynamically optimizing cursor placement through iterative refinement, the method transcends the limitations of conventional one-shot prediction paradigms. Evaluated on multiple complex programming benchmarks, it substantially improves click accuracy and task success rates, thereby enhancing agent robustness and adaptability in dense, dynamic interface environments.
📝 Abstract
Computer Use Agents (CUAs) fundamentally rely on graphical user interface (GUI) grounding to translate language instructions into executable screen actions, but editing-level grounding in dense coding interfaces, where sub-pixel accuracy is required to interact with dense IDE elements, remains underexplored. Existing approaches typically rely on single-shot coordinate prediction, which lacks a mechanism for error correction and often fails in high-density interfaces. In this technical report, we conduct an empirical study of pixel-precise cursor localization in coding environments. Instead of a single-step execution, our agent engages in an iterative refinement process, utilizing visual feedback from previous attempts to reach the target element. This closed-loop grounding mechanism allows the agent to self-correct displacement errors and adapt to dynamic UI changes. We evaluate our approach across GPT-5.4, Claude, and Qwen on a suite of complex coding benchmarks, demonstrating that multi-turn refinement significantly outperforms state-of-the-art single-shot models in both click precision and overall task success rate. Our results suggest that iterative visual reasoning is a critical component for the next generation of reliable software engineering agents. Code: https://github.com/microsoft/precision-cua-bench.