GUI-PRA: Process Reward Agent for GUI Tasks

📅 2025-09-27

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

To address the “lost-in-the-middle” problem and poor dynamic UI change awareness in multimodal large language models (MLLMs) performing long-horizon GUI tasks—leading to inaccurate process reward modeling (PRM)—this paper proposes GUI-PRM. First, it introduces a dynamic memory mechanism that mitigates context interference via relevance-based retrieval and progressive summarization of lengthy interaction histories. Second, it designs an adaptive UI perception mechanism that explicitly models interface state transitions, enabling temporally aligned action–feedback pairing for dynamic reward assessment. Experiments demonstrate that GUI-PRM significantly improves task success rates for GUI agents on complex, long-horizon benchmarks. Moreover, it achieves more accurate, context-sensitive, and temporally consistent process reward modeling compared to prior PRMs.

Technology Category

Application Category

📝 Abstract

Graphical User Interface (GUI) Agents powered by Multimodal Large Language Models (MLLMs) show significant potential for automating tasks. However, they often struggle with long-horizon tasks, leading to frequent failures. Process Reward Models (PRMs) are a promising solution, as they can guide these agents with crucial process signals during inference. Nevertheless, their application to the GUI domain presents unique challenges. When processing dense artificial inputs with long history data, PRMs suffer from a "lost in the middle" phenomenon, where the overwhelming historical context compromises the evaluation of the current step. Furthermore, standard PRMs lacks GUI changing awareness, providing static evaluations that are disconnected from the dynamic consequences of actions, a critical mismatch with the inherently dynamic nature of GUI tasks. In response to these challenges, we introduce GUI-PRA (Process Reward Agent for GUI Tasks), a judge agent designed to better provide process reward than standard PRM by intelligently processing historical context and actively perceiving UI state changes. Specifically, to directly combat the ``lost in the middle'' phenomenon, we introduce a dynamic memory mechanism consisting of two core components: a Relevance-based Retrieval Module to actively fetch pertinent information from long histories and a Progressive Summarization Module to dynamically condense growing interaction data, ensuring the model focuses on relevant context. Moreover, to address the lack of UI changing awareness, we introduce an Aadaptive UI Perception mechanism. This mechanism enables the agent to reason about UI state changes and dynamically select the most appropriate tool to gather grounded visual evidence, ensuring its evaluation is always informed by the current UI context.

Problem

Research questions and friction points this paper is trying to address.

Addressing lost in the middle phenomenon in GUI task histories

Overcoming lack of GUI changing awareness in reward models

Providing dynamic process rewards for long-horizon GUI tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic memory mechanism to handle long histories

Relevance-based retrieval for pertinent information extraction

Adaptive UI perception for dynamic state awareness

🔎 Similar Papers

BCR-DRL: Behavior- and Context-aware Reward for Deep Reinforcement Learning in Human-AI Coordination