Scaling Synthetic Task Generation for Agents via Exploration

📅 2025-09-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current research on multimodal large language models (MLLMs) as interactive agents is hindered by the scarcity of high-quality, diverse, and verifiable downstream agent benchmark datasets. Method: We propose AutoPlay—a novel “exploration–generation” co-design framework—where an MLLM acts as a unified agent to autonomously explore interface states and functionalities in real-world mobile and Ubuntu desktop environments, synthesizing environment-anchored, executable, and verifiable tasks without human annotation. By integrating exploration trajectories with structured prompting, we construct a large-scale benchmark comprising 20K mobile and 10K Ubuntu tasks. Contribution/Results: Fine-tuning UI agents on this benchmark improves task success rates by +20.0% (mobile) and +10.9% (desktop); further integrating reward modeling yields an additional +5.7% gain.

Technology Category

Application Category

📝 Abstract
Post-Training Multimodal Large Language Models (MLLMs) to build interactive agents holds promise across domains such as computer-use, web navigation, and robotics. A key challenge in scaling such post-training is lack of high-quality downstream agentic task datasets with tasks that are diverse, feasible, and verifiable. Existing approaches for task generation rely heavily on human annotation or prompting MLLM with limited downstream environment information, which is either costly or poorly scalable as it yield tasks with limited coverage. To remedy this, we present AutoPlay, a scalable pipeline for task generation that explicitly explores interactive environments to discover possible interactions and current state information to synthesize environment-grounded tasks. AutoPlay operates in two stages: (i) an exploration phase, where an MLLM explorer agent systematically uncovers novel environment states and functionalities, and (ii) a task generation phase, where a task generator leverages exploration trajectories and a set of task guideline prompts as context to synthesize diverse, executable, and verifiable tasks. We show AutoPlay generates 20k tasks across 20 Android applications and 10k tasks across 13 applications Ubuntu applications to train mobile-use and computer-use agents. AutoPlay generated tasks enable large-scale task demonstration synthesis without human annotation by employing an MLLM task executor and verifier. This data enables training MLLM-based UI agents that improve success rates up to $20.0%$ on mobile-use and $10.9%$ on computer-use scenarios. In addition, AutoPlay generated tasks combined with MLLM verifier-based rewards enable scaling reinforcement learning training of UI agents, leading to an additional $5.7%$ gain. coverage. These results establish AutoPlay as a scalable approach for post-training capable MLLM agents reducing reliance on human annotation.
Problem

Research questions and friction points this paper is trying to address.

Generating diverse and verifiable agent tasks without costly human annotation
Scaling multimodal agent training through automated environment exploration
Creating executable tasks for mobile and computer interaction agents
Innovation

Methods, ideas, or system contributions that make the work stand out.

AutoPlay pipeline explores environments to synthesize tasks
Two-stage process: exploration and task generation phases
Generates executable tasks without human annotation for training
🔎 Similar Papers
No similar papers found.