NatureGAIA: Pushing the Frontiers of GUI Agents with a Challenging Benchmark and High-Quality Trajectory Dataset

πŸ“… 2025-08-02
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing GUI agent benchmarks suffer from significant limitations in accuracy, reproducibility, and scalability. To address these issues, this paper introduces NatureGAIAβ€”a novel, causality-path-driven benchmark grounded in programmable verification of atomic task sequences, enabling a decomposable and rigorously reproducible evaluation framework. We propose a hierarchical agent architecture integrating Qwen2.5-VL-7B with reinforcement fine-tuning (RFT), yielding the first high-quality, human-verified GUI trajectory dataset featuring explicit self-correction behavior. Furthermore, we empirically uncover a critical capability bottleneck of small-scale models in long-horizon GUI tasks. Experiments reveal that even the state-of-the-art Claude-sonnet-4 achieves only a 34.6% weighted path success rate (WPSR); RFT improves WPSR from 3.3% to 10.8%, yet performance degrades substantially under complex interactive scenarios.

Technology Category

Application Category

πŸ“ Abstract
The rapid advancement of Large Language Model (LLM)-driven Graphical User Interface (GUI) agents is significantly hampered by the profound limitations of existing evaluation benchmarks in terms of accuracy, reproducibility, and scalability. To address this critical gap, we introduce Benchmark, a novel benchmark engineered on the principle of Causal Pathways. This design paradigm structures complex tasks into a series of programmatically verifiable atomic steps, ensuring a rigorous, fully automated, and reproducible standard for assessment. Concurrently, to mitigate the inherent capability deficits of agents, we developed Agent, a hierarchical agent architecture specifically optimized for long-horizon tasks. We leveraged this agent to generate a high-quality, human-verified trajectory dataset that uniquely captures diverse and even self-correcting interaction patterns of LLMs. We then utilized this dataset to perform Reinforcement Fine-Tuning (RFT) on the Qwen2.5-VL-7B model. Our experiments reveal that Benchmark~presents a formidable challenge to current state-of-the-art LLMs; even the top-performing Claude-sonnet-4 achieved a Weighted Pathway Success Rate (WPSR) of only 34.6%. Moreover, while RFT substantially improved the smaller model's GUI execution capabilities (WPSR increased from 3.3% to 10.8%), its performance degraded sharply when handling complex scenarios. This outcome highlights the inherent capability ceiling of smaller models when faced with comprehensive tasks that integrate perception, decision-making, and execution. This research contributes a rigorous evaluation standard and a high-quality dataset to the community, aiming to guide the future development of GUI agents.
Problem

Research questions and friction points this paper is trying to address.

Addressing limitations in GUI agent benchmarks for accuracy and reproducibility
Developing a hierarchical agent for complex, long-horizon GUI tasks
Enhancing smaller models' GUI capabilities via Reinforcement Fine-Tuning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Causal Pathways benchmark for GUI agent evaluation
Hierarchical agent architecture for long-horizon tasks
Reinforcement Fine-Tuning with human-verified trajectory dataset
πŸ”Ž Similar Papers
No similar papers found.
Zihan Zheng
Zihan Zheng
Ph.D. Candidate, Shanghai Jiao Tong University
artificial intelligencedeep learningcomputer vision
T
Tianle Cui
South China Normal University
C
Chuwen Xie
South China Normal University
J
Jiahui Zhang
South China Normal University
J
Jiahui Pan
South China Normal University
Lewei He
Lewei He
South China Normal University
3D PrintingDeep Learning
Q
Qianglong Chen
Zhejiang University