ScreenLLM: Stateful Screen Schema for Efficient Action Understanding and Prediction

📅 2025-03-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address core challenges in GUI agent training—including sparse supervision signals, poor scalability to large-scale data, and coarse-grained user intent modeling—this paper proposes the Stateful Screen Schema, the first approach to explicitly encode temporal user intent as lightweight, structured, dynamic representations. Building upon this schema, we introduce ScreenLLM, a dedicated multimodal large language model that jointly integrates visual layout parsing, DOM-based semantic enhancement, and state-aware action trajectory modeling. Evaluated on both open-source and proprietary models, our method achieves significant improvements in action prediction accuracy, enables cross-application and long-horizon interaction modeling, and establishes a more robust and scalable paradigm for GUI intelligence.

Technology Category

Application Category

📝 Abstract
Graphical User Interface (GUI) agents are autonomous systems that interpret and generate actions, enabling intelligent user assistance and automation. Effective training of these agent presents unique challenges, such as sparsity in supervision signals, scalability for large datasets, and the need for nuanced user understanding. We propose stateful screen schema, an efficient representation of GUI interactions that captures key user actions and intentions over time. Building on this foundation, we introduce ScreenLLM, a set of multimodal large language models (MLLMs) tailored for advanced UI understanding and action prediction. Extensive experiments on both open-source and proprietary models show that ScreenLLM accurately models user behavior and predicts actions. Our work lays the foundation for scalable, robust, and intelligent GUI agents that enhance user interaction in diverse software environments.
Problem

Research questions and friction points this paper is trying to address.

Efficient representation of GUI interactions over time
Scalable training for GUI agents with sparse signals
Accurate user behavior modeling and action prediction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Stateful screen schema for GUI interactions
Multimodal LLMs for UI understanding
Accurate user behavior modeling and prediction
🔎 Similar Papers
No similar papers found.