Breaking the Data Barrier -- Building GUI Agents Through Task Generalization

📅 2025-04-14

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

GUI agents suffer from insufficient generalization due to scarcity of high-quality trajectory data. To address this, we propose a multi-task mid-fine-tuning paradigm, which—uniquely—reveals that purely textual mathematical reasoning data yields substantial cross-modal gains (up to +5.6% / +5.4%) on GUI vision tasks, whereas GUI-perception data contributes marginally. Leveraging this insight, we design an optimal mixed mid-training data strategy guided by task efficacy evaluation, integrating data-rich non-GUI tasks (e.g., mathematical and textual reasoning) to facilitate cross-modal knowledge transfer. Our method requires no additional GUI trajectory annotation and achieves efficient transfer via instruction tuning and trajectory distillation. On WebArena and AndroidWorld, it delivers absolute improvements of +8.0% and +12.2%, respectively; further incorporating multimodal mathematical reasoning boosts AndroidWorld performance by an additional +6.3%.

Technology Category

Application Category

📝 Abstract

Graphical User Interface (GUI) agents offer cross-platform solutions for automating complex digital tasks, with significant potential to transform productivity workflows. However, their performance is often constrained by the scarcity of high-quality trajectory data. To address this limitation, we propose training Vision Language Models (VLMs) on data-rich, reasoning-intensive tasks during a dedicated mid-training stage, and then examine how incorporating these tasks facilitates generalization to GUI planning scenarios. Specifically, we explore a range of tasks with readily available instruction-tuning data, including GUI perception, multimodal reasoning, and textual reasoning. Through extensive experiments across 11 mid-training tasks, we demonstrate that: (1) Task generalization proves highly effective, yielding substantial improvements across most settings. For instance, multimodal mathematical reasoning enhances performance on AndroidWorld by an absolute 6.3%. Remarkably, text-only mathematical data significantly boosts GUI web agent performance, achieving a 5.6% improvement on WebArena and 5.4% improvement on AndroidWorld, underscoring notable cross-modal generalization from text-based to visual domains; (2) Contrary to prior assumptions, GUI perception data - previously considered closely aligned with GUI agent tasks and widely utilized for training - has a comparatively limited impact on final performance; (3) Building on these insights, we identify the most effective mid-training tasks and curate optimized mixture datasets, resulting in absolute performance gains of 8.0% on WebArena and 12.2% on AndroidWorld. Our work provides valuable insights into cross-domain knowledge transfer for GUI agents and offers a practical approach to addressing data scarcity challenges in this emerging field. The code, data and models will be available at https://github.com/hkust-nlp/GUIMid.

Problem

Research questions and friction points this paper is trying to address.

Addressing data scarcity in GUI agent training

Exploring cross-modal generalization for GUI planning

Optimizing task mixtures for improved GUI agent performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training VLMs on data-rich reasoning tasks

Exploring multimodal and textual reasoning tasks

Optimizing mid-training task mixtures for performance

🔎 Similar Papers

AgentStudio: A Toolkit for Building General Virtual Agents