🤖 AI Summary
This work addresses the critical issue of high-risk errors in GUI grounding models when executing natural language instructions, stemming from the absence of reliable uncertainty estimation mechanisms. To mitigate this, the authors propose SafeGround, a novel framework that introduces uncertainty calibration with false discovery rate (FDR) control into GUI grounding for the first time. By integrating spatially aware uncertainty quantification and test-time threshold calibration, SafeGround enables risk-aware coordinate prediction and system-level risk control. Evaluated on the ScreenSpot-Pro benchmark, SafeGround substantially outperforms existing uncertainty estimation methods, achieving up to a 5.38 percentage point improvement in system-level accuracy.
📝 Abstract
Graphical User Interface (GUI) grounding aims to translate natural language instructions into executable screen coordinates, enabling automated GUI interaction. Nevertheless, incorrect grounding can result in costly, hard-to-reverse actions (e.g., erroneous payment approvals), raising concerns about model reliability. In this paper, we introduce SafeGround, an uncertainty-aware framework for GUI grounding models that enables risk-aware predictions through calibrations before testing. SafeGround leverages a distribution-aware uncertainty quantification method to capture the spatial dispersion of stochastic samples from outputs of any given model. Then, through the calibration process, SafeGround derives a test-time decision threshold with statistically guaranteed false discovery rate (FDR) control. We apply SafeGround on multiple GUI grounding models for the challenging ScreenSpot-Pro benchmark. Experimental results show that our uncertainty measure consistently outperforms existing baselines in distinguishing correct from incorrect predictions, while the calibrated threshold reliably enables rigorous risk control and potentials of substantial system-level accuracy improvements. Across multiple GUI grounding models, SafeGround improves system-level accuracy by up to 5.38% percentage points over Gemini-only inference.