FineState-Bench: A Comprehensive Benchmark for Fine-Grained State Control in GUI Agents

📅 2025-08-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing GUI agent evaluation frameworks overemphasize task completion rates while neglecting fine-grained state control capabilities. To address this, we propose FineState-Bench—the first benchmark dedicated to evaluating fine-grained GUI state control across heterogeneous platforms (desktop, web, and mobile), comprising 2,257 diverse tasks. We introduce a four-stage assessment pipeline—Perceive → Locate → Decide → Execute—that enables the first quantitative, decoupled analysis of visual perception and localization abilities. Furthermore, we design a hybrid evaluation paradigm integrating static screenshots with dynamic interaction traces and open-source VDA, a plug-and-play visual diagnostic toolkit. Experiments reveal that state-of-the-art models achieve only 32.8% accuracy in fine-grained interaction; moreover, ideal visual localization boosts Gemini-2.5-Flash’s success rate by 14.9%. All benchmark data, code, and tools are publicly released.

Technology Category

Application Category

📝 Abstract
With the rapid advancement of generative artificial intelligence technology, Graphical User Interface (GUI) agents have demonstrated tremendous potential for autonomously managing daily tasks through natural language instructions. However, current evaluation frameworks for GUI agents suffer from fundamental flaws: existing benchmarks overly focus on coarse-grained task completion while neglecting fine-grained control capabilities crucial for real-world applications. To address this, we introduce FineState-Bench, the first evaluation and diagnostic standard for fine-grained GUI proxy operations, designed to quantify fine-grained control. This multi-platform (desktop, Web, mobile) framework includes 2257 task benchmarks in four components and uses a four-phase indicator for comprehensive perception-to-control assessment. To analyze perception and positioning for refined operations, we developed the plug-and-play Visual Diagnostic Assistant (VDA), enabling the first quantitative decoupling analysis of these capabilities. Experimental results on our benchmark show that the most advanced models achieve only 32.8% fine-grained interaction accuracy. Using our VDA in controlled experiments, quantifying the impact of visual capabilities, we showed that ideal visual localization boosts Gemini-2.5-Flash's success rate by 14.9%. Our diagnostic framework confirms for the first time that the primary bottleneck for current GUI proxies is basic visual positioning capability.All resources are fully open-source. github: https://github.com/AnonymousThewarehouse/FineState-Bench huggingface: https://huggingface.co/datasets/Willtime2006/Static-FineBench
Problem

Research questions and friction points this paper is trying to address.

Evaluates fine-grained control in GUI agents for real-world tasks
Quantifies visual perception and positioning impact on interaction accuracy
Identifies visual positioning as the main bottleneck in GUI agents
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces FineState-Bench for fine-grained GUI control
Develops Visual Diagnostic Assistant for decoupled analysis
Quantifies visual capability impact on interaction accuracy
🔎 Similar Papers
No similar papers found.
Fengxian Ji
Fengxian Ji
Northeast University
agent、Machine learnin、CV
J
Jingpu Yang
Northeastern University, China
Zirui Song
Zirui Song
PhD student in MBZUAI
NLP
Y
Yuanxi Wang
Northeastern University, China
Z
Zhexuan Cui
Northeastern University, China
Y
Yuke Li
Northeastern University, China
Qian Jiang
Qian Jiang
Northeastern University
ANYTHING I am interested in
M
Miao Fang
Northeastern University, China
Xiuying Chen
Xiuying Chen
MBZUAI
Trustworthy NLPHuman-Centered NLPComputational Social Science