GUI-PRA: Process Reward Agent for GUI Tasks

📅 2025-09-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the “lost-in-the-middle” problem and poor dynamic UI change awareness in multimodal large language models (MLLMs) performing long-horizon GUI tasks—leading to inaccurate process reward modeling (PRM)—this paper proposes GUI-PRM. First, it introduces a dynamic memory mechanism that mitigates context interference via relevance-based retrieval and progressive summarization of lengthy interaction histories. Second, it designs an adaptive UI perception mechanism that explicitly models interface state transitions, enabling temporally aligned action–feedback pairing for dynamic reward assessment. Experiments demonstrate that GUI-PRM significantly improves task success rates for GUI agents on complex, long-horizon benchmarks. Moreover, it achieves more accurate, context-sensitive, and temporally consistent process reward modeling compared to prior PRMs.

Technology Category

Application Category

📝 Abstract
Graphical User Interface (GUI) Agents powered by Multimodal Large Language Models (MLLMs) show significant potential for automating tasks. However, they often struggle with long-horizon tasks, leading to frequent failures. Process Reward Models (PRMs) are a promising solution, as they can guide these agents with crucial process signals during inference. Nevertheless, their application to the GUI domain presents unique challenges. When processing dense artificial inputs with long history data, PRMs suffer from a "lost in the middle" phenomenon, where the overwhelming historical context compromises the evaluation of the current step. Furthermore, standard PRMs lacks GUI changing awareness, providing static evaluations that are disconnected from the dynamic consequences of actions, a critical mismatch with the inherently dynamic nature of GUI tasks. In response to these challenges, we introduce GUI-PRA (Process Reward Agent for GUI Tasks), a judge agent designed to better provide process reward than standard PRM by intelligently processing historical context and actively perceiving UI state changes. Specifically, to directly combat the ``lost in the middle'' phenomenon, we introduce a dynamic memory mechanism consisting of two core components: a Relevance-based Retrieval Module to actively fetch pertinent information from long histories and a Progressive Summarization Module to dynamically condense growing interaction data, ensuring the model focuses on relevant context. Moreover, to address the lack of UI changing awareness, we introduce an Aadaptive UI Perception mechanism. This mechanism enables the agent to reason about UI state changes and dynamically select the most appropriate tool to gather grounded visual evidence, ensuring its evaluation is always informed by the current UI context.
Problem

Research questions and friction points this paper is trying to address.

Addressing lost in the middle phenomenon in GUI task histories
Overcoming lack of GUI changing awareness in reward models
Providing dynamic process rewards for long-horizon GUI tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic memory mechanism to handle long histories
Relevance-based retrieval for pertinent information extraction
Adaptive UI perception for dynamic state awareness
🔎 Similar Papers
No similar papers found.