🤖 AI Summary
Existing approaches to mobile UI testing and AI agent training face challenges in modeling dynamic environments, as they rely either on physical devices or static screenshots—limiting scalability and realism. To address this, we propose an image-based interactive UI simulator that employs a two-stage paradigm: first predicting the structured layout of the next UI state, then synthesizing a visually consistent screen image conditioned on that layout. This enables high-fidelity, temporally coherent UI transition simulation. The system integrates UI layout prediction with state-of-the-art diffusion-based image generation, supporting end-to-end modeling and rendering of UI state sequences. Experiments demonstrate significant improvements over end-to-end baselines in visual authenticity, state continuity, and interaction plausibility. Our approach provides a scalable, lightweight simulation infrastructure for UI automation testing, rapid prototyping, and embodied AI agent training.
📝 Abstract
Developing and testing user interfaces (UIs) and training AI agents to interact with them are challenging due to the dynamic and diverse nature of real-world mobile environments. Existing methods often rely on cumbersome physical devices or limited static analysis of screenshots, which hinders scalable testing and the development of intelligent UI agents. We introduce UISim, a novel image-based UI simulator that offers a dynamic and interactive platform for exploring mobile phone environments purely from screen images. Our system employs a two-stage method: given an initial phone screen image and a user action, it first predicts the abstract layout of the next UI state, then synthesizes a new, visually consistent image based on this predicted layout. This approach enables the realistic simulation of UI transitions. UISim provides immediate practical benefits for UI testing, rapid prototyping, and synthetic data generation. Furthermore, its interactive capabilities pave the way for advanced applications, such as UI navigation task planning for AI agents. Our experimental results show that UISim outperforms end-to-end UI generation baselines in generating realistic and coherent subsequent UI states, highlighting its fidelity and potential to streamline UI development and enhance AI agent training.