🤖 AI Summary
This work addresses the limitations of current mobile agent research, which is hindered by the lack of scalable, controllable, and reproducible usage environments. The authors propose an automated pipeline that leverages real-user GUI interaction traces to transform screenshots and interaction logs into executable simulated Android environments, task specifications, rule-based verifiers, and training trajectories. By reconstructing screen state graphs and interaction logic—integrating both static content and dynamic states—the method enables efficient construction of large-scale, diverse environments. Evaluated across 34 applications spanning 16 domains, the approach demonstrates significant performance gains on four major benchmarks under a fixed training budget, with the PhoneWorld metric improving by 52.5 points, thereby validating the positive impact of environment scale and application diversity on agent performance.
📝 Abstract
A central bottleneck for phone-use agents is that controllable, reproducible environments covering real mobile behavior are hard to build at scale. Existing mobile-agent benchmarks have made important progress on evaluation, but they do not by themselves provide a scalable way to construct many new phone-use environments. We present PhoneWorld, a reusable pipeline that converts real GUI trajectories and screenshots into controllable phone-use environments, executable tasks, automatic verifiers, and training rollouts. Rather than hand-building one mobile benchmark at a time, PhoneWorld uses real trajectories to recover which screens matter, how screens connect, which interactions must change environment state, and which user goals admit automatic verification. From these signals, it builds runnable mock Android apps backed by read-only app content and mutable state, then derives executable tasks, rule-based verifiers, and training rollouts from the same environments. In its current instantiation, PhoneWorld covers 34 apps across 16 domains, spanning common consumer mobile behaviors such as search, browsing, shopping, booking, media, and social interaction. Under a fixed training budget, replacing 10K steps from an auxiliary AndroidWorld corpus in an AndroidWorld-based baseline with broad PhoneWorld supervision improves all four evaluation benchmarks at once, raising HYMobileBench by 17.7 points, AndroidControl by 6.0 points, AndroidWorld by 14.7 points, and PhoneWorld by 52.5 points. We then study two additional scaling questions: increasing the amount of PhoneWorld supervision strongly improves PhoneWorld performance, and under a fixed PhoneWorld budget, expanding app coverage yields even larger gains. Overall, PhoneWorld shifts the focus from building one mobile benchmark at a time to scaling the supply of phone-use environments themselves.