Scaling, Benchmarking, and Reasoning of Vision-Language Agents for Mobile GUI Navigation

📅 2026-05-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing mobile GUI navigation research, which has been hindered by the absence of large-scale real-world datasets and standardized evaluation benchmarks. We introduce HyperTrack, a dataset comprising over 10,000 real user tasks spanning more than 650 Chinese mobile applications, and release GUIEvalKit, an open-source toolkit that establishes a standardized offline evaluation framework. Through systematic comparison of supervised fine-tuning and reinforcement fine-tuning, we demonstrate that the latter significantly outperforms the former in cross-domain generalization. Further experiments reveal that the task-completion performance of vision-language models critically depends on their capacity for multi-step reasoning and modeling of interaction history. This study provides essential infrastructure and empirical evidence to advance the evaluability and scalability of VLM-based agents in complex GUI environments.
📝 Abstract
Vision-Language Models (VLMs) have shown rapid progress in mobile GUI navigation. This paper presents a systematic study of data scaling, benchmarking, and reasoning for VLM-based agents in this domain. To facilitate rigorous evaluation, we introduce HyperTrack, a large-scale dataset with over 16000 real-world tasks across more than 650 Chinese mobile applications, along with GUIEvalKit, an open-source toolkit for unified benchmarking of VLMs on offline GUI navigation tasks. Using HyperTrack, we analyze the effects of training data scale on both supervised and reinforcement-based finetuning. Our results show that reinforcement-based finetuning consistently outperforms supervised finetuning, particularly in out-of-domain settings, highlighting the synergy between data scaling and reinforcement learning. Leveraging GUIEvalKit, we further benchmark state-of-the-art (SOTA) VLMs and analyze how interaction history and reasoning capabilities influence task completion. Together, HyperTrack and GUIEvalKit provide a comprehensive platform for developing and evaluating VLM agents in mobile GUI navigation tasks.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language Models
Mobile GUI Navigation
Data Scaling
Benchmarking
Reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language Models
Mobile GUI Navigation
Data Scaling
Reinforcement Learning
Benchmarking
🔎 Similar Papers