π€ AI Summary
This work addresses the challenge that existing GUI agents struggle to replicate human-like swiping behaviors, which has become a critical bottleneck in task execution. To overcome this limitation, we propose SwipeGen, a novel framework that, for the first time, decomposes human swiping gestures into multiple quantifiable dimensions and constructs a GUI exploration-driven synthetic data pipeline. By fine-tuning vision-language models with this synthetic data, our approach significantly enhances agentsβ swiping capabilities. Our contributions include the first benchmark specifically designed for evaluating swiping performance and a new paradigm for augmenting interactive skills through synthetic data. The resulting agent, GUISwiper, achieves a swiping accuracy of 69.07%, representing a 214% improvement over current vision-language model baselines.
π Abstract
With the widespread adoption of Graphical User Interface (GUI) agents for automating GUI interaction tasks, substantial research focused on improving GUI perception to ground task instructions into concrete action steps. However, the step execution capability of these agents has gradually emerged as a new bottleneck for task completion. In particular, existing GUI agents often adopt overly simplified strategies for handling swipe interactions, preventing them from accurately replicating human-like behavior. To address this limitation, we decompose human swipe gestures into multiple quantifiable dimensions and propose an automated pipeline SwipeGen to synthesize human-like swipe interactions through GUI exploration. Based on this pipeline, we construct and release the first benchmark for evaluating the swipe execution capability of GUI agents. Furthermore, leveraging the synthesized data, we propose GUISwiper, a GUI agent with enhanced interaction execution capabilities. Experimental results demonstrate that GUISwiper achieves a swipe execution accuracy of 69.07%, representing a 214% improvement over existing VLM baselines.