🤖 AI Summary
To address core challenges in GUI agent training—including sparse supervision signals, poor scalability to large-scale data, and coarse-grained user intent modeling—this paper proposes the Stateful Screen Schema, the first approach to explicitly encode temporal user intent as lightweight, structured, dynamic representations. Building upon this schema, we introduce ScreenLLM, a dedicated multimodal large language model that jointly integrates visual layout parsing, DOM-based semantic enhancement, and state-aware action trajectory modeling. Evaluated on both open-source and proprietary models, our method achieves significant improvements in action prediction accuracy, enables cross-application and long-horizon interaction modeling, and establishes a more robust and scalable paradigm for GUI intelligence.
📝 Abstract
Graphical User Interface (GUI) agents are autonomous systems that interpret and generate actions, enabling intelligent user assistance and automation. Effective training of these agent presents unique challenges, such as sparsity in supervision signals, scalability for large datasets, and the need for nuanced user understanding. We propose stateful screen schema, an efficient representation of GUI interactions that captures key user actions and intentions over time. Building on this foundation, we introduce ScreenLLM, a set of multimodal large language models (MLLMs) tailored for advanced UI understanding and action prediction. Extensive experiments on both open-source and proprietary models show that ScreenLLM accurately models user behavior and predicts actions. Our work lays the foundation for scalable, robust, and intelligent GUI agents that enhance user interaction in diverse software environments.