π€ AI Summary
Existing GUI agent benchmarks suffer from significant limitations in accuracy, reproducibility, and scalability. To address these issues, this paper introduces NatureGAIAβa novel, causality-path-driven benchmark grounded in programmable verification of atomic task sequences, enabling a decomposable and rigorously reproducible evaluation framework. We propose a hierarchical agent architecture integrating Qwen2.5-VL-7B with reinforcement fine-tuning (RFT), yielding the first high-quality, human-verified GUI trajectory dataset featuring explicit self-correction behavior. Furthermore, we empirically uncover a critical capability bottleneck of small-scale models in long-horizon GUI tasks. Experiments reveal that even the state-of-the-art Claude-sonnet-4 achieves only a 34.6% weighted path success rate (WPSR); RFT improves WPSR from 3.3% to 10.8%, yet performance degrades substantially under complex interactive scenarios.
π Abstract
The rapid advancement of Large Language Model (LLM)-driven Graphical User Interface (GUI) agents is significantly hampered by the profound limitations of existing evaluation benchmarks in terms of accuracy, reproducibility, and scalability. To address this critical gap, we introduce Benchmark, a novel benchmark engineered on the principle of Causal Pathways. This design paradigm structures complex tasks into a series of programmatically verifiable atomic steps, ensuring a rigorous, fully automated, and reproducible standard for assessment. Concurrently, to mitigate the inherent capability deficits of agents, we developed Agent, a hierarchical agent architecture specifically optimized for long-horizon tasks. We leveraged this agent to generate a high-quality, human-verified trajectory dataset that uniquely captures diverse and even self-correcting interaction patterns of LLMs. We then utilized this dataset to perform Reinforcement Fine-Tuning (RFT) on the Qwen2.5-VL-7B model. Our experiments reveal that Benchmark~presents a formidable challenge to current state-of-the-art LLMs; even the top-performing Claude-sonnet-4 achieved a Weighted Pathway Success Rate (WPSR) of only 34.6%. Moreover, while RFT substantially improved the smaller model's GUI execution capabilities (WPSR increased from 3.3% to 10.8%), its performance degraded sharply when handling complex scenarios. This outcome highlights the inherent capability ceiling of smaller models when faced with comprehensive tasks that integrate perception, decision-making, and execution. This research contributes a rigorous evaluation standard and a high-quality dataset to the community, aiming to guide the future development of GUI agents.