GUI-ReWalk: Massive Data Generation for GUI Agent via Stochastic Exploration and Intent-Aware Reasoning

📅 2025-09-19

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

GUI agents are hindered by the scarcity of high-quality, scalable interactive trajectory data: manual annotation is costly and inconsistent, while synthetic data often compromises between diversity and task fidelity. This paper introduces GUI-ReWalk, the first framework to integrate stochastic exploration with intent-aware reasoning for dual-mode trajectory generation. Leveraging synergistic collaboration between large language models (LLMs) and vision-language models (VLMs), it implements an intent inference module and a multi-stage trajectory synthesis mechanism, enabling cross-application, multi-step, long-horizon workflow modeling. Evaluated on benchmarks including Screenspot-Pro and OSWorld-G, GUI-ReWalk significantly improves trajectory entropy (+23.6%), interaction flow coverage (+31.4%), and user intent fidelity. When used to train Qwen2.5-VL-7B, the resulting model achieves substantial gains in GUI understanding and task execution performance over baseline methods.

Technology Category

Application Category

📝 Abstract

Graphical User Interface (GUI) Agents, powered by large language and vision-language models, hold promise for enabling end-to-end automation in digital environments. However, their progress is fundamentally constrained by the scarcity of scalable, high-quality trajectory data. Existing data collection strategies either rely on costly and inconsistent manual annotations or on synthetic generation methods that trade off between diversity and meaningful task coverage. To bridge this gap, we present GUI-ReWalk: a reasoning-enhanced, multi-stage framework for synthesizing realistic and diverse GUI trajectories. GUI-ReWalk begins with a stochastic exploration phase that emulates human trial-and-error behaviors, and progressively transitions into a reasoning-guided phase where inferred goals drive coherent and purposeful interactions. Moreover, it supports multi-stride task generation, enabling the construction of long-horizon workflows across multiple applications. By combining randomness for diversity with goal-aware reasoning for structure, GUI-ReWalk produces data that better reflects the intent-aware, adaptive nature of human-computer interaction. We further train Qwen2.5-VL-7B on the GUI-ReWalk dataset and evaluate it across multiple benchmarks, including Screenspot-Pro, OSWorld-G, UI-Vision, AndroidControl, and GUI-Odyssey. Results demonstrate that GUI-ReWalk enables superior coverage of diverse interaction flows, higher trajectory entropy, and more realistic user intent. These findings establish GUI-ReWalk as a scalable and data-efficient framework for advancing GUI agent research and enabling robust real-world automation.

Problem

Research questions and friction points this paper is trying to address.

Addresses scarcity of scalable GUI agent trajectory data

Bridges gap between synthetic diversity and meaningful tasks

Generates realistic human-computer interaction data via reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Stochastic exploration emulates human trial-error

Intent-aware reasoning guides goal-driven interactions

Multi-stride generation enables long-horizon workflows

🔎 Similar Papers

Identifying User Goals from UI Trajectories

2024-06-20arXiv.orgCitations: 3

GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices

2024-06-12arXiv.orgCitations: 47

Bosch Group

Renningen, BW, DE

Research Scientist Intern, Multimodal Generative AI and Robotics (PhD)