🤖 AI Summary
This work proposes an efficient reinforcement learning framework based on multimodal large language models (MLLMs) to address the high costs of data collection and policy optimization for GUI agents in non-stationary environments. The approach introduces an Agentic-Q estimation mechanism that leverages self-generated trajectories to decouple policy updates from environment interaction, enabling stepwise policy optimization that substantially reduces both data and computational overhead. Experimental results demonstrate that the proposed method endows the Ovis2.5-9B model with strong GUI interaction capabilities, outperforming larger-scale existing models on navigation and localization benchmarks, thereby validating its effectiveness and scalability.
📝 Abstract
Recent advances in Multimodal Large Language Models (MLLMs) have substantially driven the progress of autonomous agents for Graphical User Interface (GUI). Nevertheless, in real-world applications, GUI agents are often faced with non-stationary environments, leading to high computational costs for data curation and policy optimization. In this report, we introduce a novel MLLM-centered framework for GUI agents, which consists of two components: agentic-Q estimation and step-wise policy optimization. The former one aims to optimize a Q-model that can generate step-wise values to evaluate the contribution of a given action to task completion. The latter one takes step-wise samples from the state-action trajectory as inputs, and optimizes the policy via reinforcement learning with our agentic-Q model. It should be noticed that (i) all state-action trajectories are produced by the policy itself, so that the data collection costs are manageable; (ii) the policy update is decoupled from the environment, ensuring stable and efficient optimization. Empirical evaluations show that our framework endows Ovis2.5-9B with powerful GUI interaction capabilities, achieving remarkable performances on GUI navigation and grounding benchmarks and even surpassing contenders with larger scales.