NatureGAIA: Pushing the Frontiers of GUI Agents with a Challenging Benchmark and High-Quality Trajectory Dataset

📅 2025-08-02

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

Existing GUI agent benchmarks suffer from significant limitations in accuracy, reproducibility, and scalability. To address these issues, this paper introduces NatureGAIA—a novel, causality-path-driven benchmark grounded in programmable verification of atomic task sequences, enabling a decomposable and rigorously reproducible evaluation framework. We propose a hierarchical agent architecture integrating Qwen2.5-VL-7B with reinforcement fine-tuning (RFT), yielding the first high-quality, human-verified GUI trajectory dataset featuring explicit self-correction behavior. Furthermore, we empirically uncover a critical capability bottleneck of small-scale models in long-horizon GUI tasks. Experiments reveal that even the state-of-the-art Claude-sonnet-4 achieves only a 34.6% weighted path success rate (WPSR); RFT improves WPSR from 3.3% to 10.8%, yet performance degrades substantially under complex interactive scenarios.

Technology Category

Application Category

📝 Abstract

The rapid advancement of Large Language Model (LLM)-driven Graphical User Interface (GUI) agents is significantly hampered by the profound limitations of existing evaluation benchmarks in terms of accuracy, reproducibility, and scalability. To address this critical gap, we introduce Benchmark, a novel benchmark engineered on the principle of Causal Pathways. This design paradigm structures complex tasks into a series of programmatically verifiable atomic steps, ensuring a rigorous, fully automated, and reproducible standard for assessment. Concurrently, to mitigate the inherent capability deficits of agents, we developed Agent, a hierarchical agent architecture specifically optimized for long-horizon tasks. We leveraged this agent to generate a high-quality, human-verified trajectory dataset that uniquely captures diverse and even self-correcting interaction patterns of LLMs. We then utilized this dataset to perform Reinforcement Fine-Tuning (RFT) on the Qwen2.5-VL-7B model. Our experiments reveal that Benchmark~presents a formidable challenge to current state-of-the-art LLMs; even the top-performing Claude-sonnet-4 achieved a Weighted Pathway Success Rate (WPSR) of only 34.6%. Moreover, while RFT substantially improved the smaller model's GUI execution capabilities (WPSR increased from 3.3% to 10.8%), its performance degraded sharply when handling complex scenarios. This outcome highlights the inherent capability ceiling of smaller models when faced with comprehensive tasks that integrate perception, decision-making, and execution. This research contributes a rigorous evaluation standard and a high-quality dataset to the community, aiming to guide the future development of GUI agents.

Problem

Research questions and friction points this paper is trying to address.

Addressing limitations in GUI agent benchmarks for accuracy and reproducibility

Developing a hierarchical agent for complex, long-horizon GUI tasks

Enhancing smaller models' GUI capabilities via Reinforcement Fine-Tuning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Causal Pathways benchmark for GUI agent evaluation

Hierarchical agent architecture for long-horizon tasks

Reinforcement Fine-Tuning with human-verified trajectory dataset

🔎 Similar Papers

Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents