🤖 AI Summary
High-quality UI trajectory data for training digital agents is scarce, and manual annotation or real-world collection is prohibitively expensive.
Method: This paper proposes UI-Simulator-Grow—a large language model (LLM)-driven digital world simulator that integrates guided rollout exploration and trajectory encapsulation to autonomously generate large-scale, diverse, structured UI state-transition sequences. It employs a targeted expansion strategy to prioritize high-value task trajectories and enables efficient data augmentation on small models (e.g., Llama-3-8B).
Contribution/Results: Experiments show that agents trained with UI-Simulator-Grow achieve performance on par with or surpassing open-source agents trained on real UI data—matching the capability of Llama-3-70B-based agents—on WebArena and AndroidWorld benchmarks. The method significantly improves generalization and training efficiency while reducing reliance on costly human-annotated or environment-collected data.
📝 Abstract
Digital agents require diverse, large-scale UI trajectories to generalize across real-world tasks, yet collecting such data is prohibitively expensive in both human annotation, infra and engineering perspectives. To this end, we introduce $ extbf{UI-Simulator}$, a scalable paradigm that generates structured UI states and transitions to synthesize training trajectories at scale. Our paradigm integrates a digital world simulator for diverse UI states, a guided rollout process for coherent exploration, and a trajectory wrapper that produces high-quality and diverse trajectories for agent training. We further propose $ extbf{UI-Simulator-Grow}$, a targeted scaling strategy that enables more rapid and data-efficient scaling by prioritizing high-impact tasks and synthesizes informative trajectory variants. Experiments on WebArena and AndroidWorld show that UI-Simulator rivals or surpasses open-source agents trained on real UIs with significantly better robustness, despite using weaker teacher models. Moreover, UI-Simulator-Grow matches the performance of Llama-3-70B-Instruct using only Llama-3-8B-Instruct as the base model, highlighting the potential of targeted synthesis scaling paradigm to continuously and efficiently enhance the digital agents.