Chain-of-Ground: Improving GUI Grounding via Iterative Reasoning and Reference Feedback

📅 2025-12-01

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

GUI localization aims to accurately map natural language instructions to visual interface regions; however, existing multimodal large models exhibit limited precision in localizing small targets, visually similar regions, and layouts with ambiguous spatial relationships. To address this, we propose a training-free, multi-step iterative chain-of-reasoning framework that progressively generates hypotheses, reflects upon them, and refines predictions—enhancing both accuracy and interpretability. Our approach innovatively integrates visual reasoning with reference-based feedback mechanisms. Furthermore, we introduce TPanel UI, the first industrial-scale GUI dataset featuring controlled perturbations for evaluating real-world generalization. Experiments demonstrate state-of-the-art performance: 68.4% accuracy on ScreenSpot Pro (+4.8 percentage points), and a +6.9-point improvement over the Qwen3-VL-235B baseline on TPanel UI—substantially boosting robustness in complex, realistic GUI localization scenarios.

Technology Category

Application Category

📝 Abstract

GUI grounding aims to align natural language instructions with precise regions in complex user interfaces. Advanced multimodal large language models show strong ability in visual GUI grounding but still struggle with small or visually similar targets and ambiguity in real world layouts. These limitations arise from limited grounding capacity and from underuse of existing reasoning potential. We present Chain of Ground CoG a training free multi step grounding framework that uses multimodal large language models for iterative visual reasoning and refinement. Instead of direct prediction the model progressively reflects and adjusts its hypotheses leading to more accurate and interpretable localization. Our approach achieves 68.4 accuracy on the ScreenSpot Pro benchmark an improvement of 4.8 points. To measure real world generalization we introduce TPanel UI a dataset of 420 labeled industrial control panels with visual distortions such as blur and masking. On TPanel UI Chain of Ground improves over the strong baseline Qwen3 VL 235B by 6.9 points showing the effectiveness of multi step training free grounding across real world and digital interfaces. These results highlight a direction for unlocking grounding potential through structured iterative refinement instead of additional training.

Problem

Research questions and friction points this paper is trying to address.

Improves GUI grounding for small or similar targets

Addresses ambiguity in real-world interface layouts

Enhances grounding without additional model training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Iterative visual reasoning and refinement framework

Training-free multi-step grounding without additional training

Improves GUI localization accuracy on real-world interfaces

🔎 Similar Papers

Visual grounding for desktop graphical user interfaces