FineState-Bench: Benchmarking State-Conditioned Grounding for Fine-grained GUI State Setting

📅 2026-04-30

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

Current evaluation methods for GUI agents suffer from insufficient coverage, ambiguous target states, and overreliance on final task success rates, particularly in fine-grained, state-conditioned tasks, thereby hindering precise failure diagnosis. To address these limitations, this work introduces FineState-Bench, a benchmark comprising 2,209 instances across desktop, web, and mobile platforms, with explicitly defined target states for UI elements. The study further proposes a four-stage diagnostic metric and a plug-in Visual Diagnostic Assistant (VDA), which leverages large vision-language models, bounding-box prompting, and controlled contrastive experiments to enable fine-grained assessment of visual grounding and state-achievement capabilities. Experiments reveal that state-of-the-art models achieve only a 22.8% average exact-state interaction success rate across platforms, while integrating VDA boosts Gemini-2.5-Flash’s performance by 14.9 percentage points, highlighting both the critical bottleneck and substantial potential for improvement in visual grounding.

📝 Abstract

Despite the rapid progress of large vision-language models (LVLMs), fine-grained, state-conditioned GUI interaction remains challenging. Current evaluations offer limited coverage, imprecise target-state definitions, and an overreliance on final-task success, obscuring where and why agents fail. To address this gap, we introduce \textbf{FineState-Bench}, a benchmark that evaluates whether an agent can correctly ground an instruction to the intended UI control and reach the exact target state. FineState-Bench comprises 2,209 instances across desktop, web, and mobile platforms, spanning four interaction families and 23 UI component types, with each instance explicitly specifying an exact target state for fine-grained state setting. We further propose \textit{FineState-Metrics}, a four-stage diagnostic pipeline with stage-wise success rates: Localization Success Rate (SR@Loc), Interaction Success Rate (SR@Int), Exact State Success Rate at Locate (ES-SR@Loc), and Exact State Success Rate at Interact (ES-SR@Int), and a plug-and-play \textit{Visual Diagnostic Assistant} (VDA) that generates a Description and a bounding-box Localization Hint to diagnose visual grounding reason via controlled w/ vs.\ w/o comparisons. On FineState-Bench, exact goal-state success remains low: ES-SR@Int peaks at 32.8\% on Web and 22.8\% on average across platforms. With VDA localization hints, Gemini-2.5-Flash gains +14.9 ES-SR@Int points, suggesting substantial headroom from improved visual grounding, yet overall accuracy is still insufficient for reliable fine-grained state-conditioned interaction \href{https://github.com/FengxianJi/FineState-Bench}{Github.}

Problem

Research questions and friction points this paper is trying to address.

state-conditioned grounding

fine-grained GUI interaction

target-state definition

visual grounding

GUI benchmarking

Innovation

Methods, ideas, or system contributions that make the work stand out.

state-conditioned grounding

fine-grained GUI interaction

diagnostic evaluation metrics