SafeGround: Know When to Trust GUI Grounding Models via Uncertainty Calibration

📅 2026-02-02

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

This work addresses the critical issue of high-risk errors in GUI grounding models when executing natural language instructions, stemming from the absence of reliable uncertainty estimation mechanisms. To mitigate this, the authors propose SafeGround, a novel framework that introduces uncertainty calibration with false discovery rate (FDR) control into GUI grounding for the first time. By integrating spatially aware uncertainty quantification and test-time threshold calibration, SafeGround enables risk-aware coordinate prediction and system-level risk control. Evaluated on the ScreenSpot-Pro benchmark, SafeGround substantially outperforms existing uncertainty estimation methods, achieving up to a 5.38 percentage point improvement in system-level accuracy.

Technology Category

Application Category

📝 Abstract

Graphical User Interface (GUI) grounding aims to translate natural language instructions into executable screen coordinates, enabling automated GUI interaction. Nevertheless, incorrect grounding can result in costly, hard-to-reverse actions (e.g., erroneous payment approvals), raising concerns about model reliability. In this paper, we introduce SafeGround, an uncertainty-aware framework for GUI grounding models that enables risk-aware predictions through calibrations before testing. SafeGround leverages a distribution-aware uncertainty quantification method to capture the spatial dispersion of stochastic samples from outputs of any given model. Then, through the calibration process, SafeGround derives a test-time decision threshold with statistically guaranteed false discovery rate (FDR) control. We apply SafeGround on multiple GUI grounding models for the challenging ScreenSpot-Pro benchmark. Experimental results show that our uncertainty measure consistently outperforms existing baselines in distinguishing correct from incorrect predictions, while the calibrated threshold reliably enables rigorous risk control and potentials of substantial system-level accuracy improvements. Across multiple GUI grounding models, SafeGround improves system-level accuracy by up to 5.38% percentage points over Gemini-only inference.

Problem

Research questions and friction points this paper is trying to address.

GUI grounding

uncertainty calibration

risk control

false discovery rate

model reliability

Innovation

Methods, ideas, or system contributions that make the work stand out.

uncertainty calibration

GUI grounding

false discovery rate control