ScreenLLM: Stateful Screen Schema for Efficient Action Understanding and Prediction

📅 2025-03-26

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

To address core challenges in GUI agent training—including sparse supervision signals, poor scalability to large-scale data, and coarse-grained user intent modeling—this paper proposes the Stateful Screen Schema, the first approach to explicitly encode temporal user intent as lightweight, structured, dynamic representations. Building upon this schema, we introduce ScreenLLM, a dedicated multimodal large language model that jointly integrates visual layout parsing, DOM-based semantic enhancement, and state-aware action trajectory modeling. Evaluated on both open-source and proprietary models, our method achieves significant improvements in action prediction accuracy, enables cross-application and long-horizon interaction modeling, and establishes a more robust and scalable paradigm for GUI intelligence.

Technology Category

Application Category

📝 Abstract

Graphical User Interface (GUI) agents are autonomous systems that interpret and generate actions, enabling intelligent user assistance and automation. Effective training of these agent presents unique challenges, such as sparsity in supervision signals, scalability for large datasets, and the need for nuanced user understanding. We propose stateful screen schema, an efficient representation of GUI interactions that captures key user actions and intentions over time. Building on this foundation, we introduce ScreenLLM, a set of multimodal large language models (MLLMs) tailored for advanced UI understanding and action prediction. Extensive experiments on both open-source and proprietary models show that ScreenLLM accurately models user behavior and predicts actions. Our work lays the foundation for scalable, robust, and intelligent GUI agents that enhance user interaction in diverse software environments.

Problem

Research questions and friction points this paper is trying to address.

Efficient representation of GUI interactions over time

Scalable training for GUI agents with sparse signals

Accurate user behavior modeling and action prediction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Stateful screen schema for GUI interactions

Multimodal LLMs for UI understanding

Accurate user behavior modeling and prediction

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs