Enhance Mobile Agents Thinking Process Via Iterative Preference Learning

📅 2025-05-18

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

To address the weak reasoning generalization of existing Vision-Language Model (VLM)-based mobile GUI agents—stemming from the scarcity of Chain-of-Action-Thoughts (CoaT) trajectories—this paper proposes Thinking-level Direct Preference Optimization (T-DPO). Methodologically: (1) a CoaT tree is constructed to enable iterative trajectory sampling; (2) a rule-driven, annotation-free process reward function is designed; (3) high-quality chain-of-thought preference data are generated via GPT-4o–assisted instruction evolution and a three-stage instruction enhancement strategy; and (4) end-to-end optimization is performed by jointly applying supervised fine-tuning and T-DPO. Evaluated on three major mobile GUI benchmarks, our approach achieves state-of-the-art performance, significantly outperforming strong baselines such as OS-ATLAS and UI-TARS, while demonstrating superior cross-domain generalization capability.

Technology Category

Application Category

📝 Abstract

The Chain of Action-Planning Thoughts (CoaT) paradigm has been shown to improve the reasoning performance of VLM-based mobile agents in GUI tasks. However, the scarcity of diverse CoaT trajectories limits the expressiveness and generalization ability of such agents. While self-training is commonly employed to address data scarcity, existing approaches either overlook the correctness of intermediate reasoning steps or depend on expensive process-level annotations to construct process reward models (PRM). To address the above problems, we propose an Iterative Preference Learning (IPL) that constructs a CoaT-tree through interative sampling, scores leaf nodes using rule-based reward, and backpropagates feedback to derive Thinking-level Direct Preference Optimization (T-DPO) pairs. To prevent overfitting during warm-up supervised fine-tuning, we further introduce a three-stage instruction evolution, which leverages GPT-4o to generate diverse Q&A pairs based on real mobile UI screenshots, enhancing both generality and layout understanding. Experiments on three standard Mobile GUI-agent benchmarks demonstrate that our agent MobileIPL outperforms strong baselines, including continual pretraining models such as OS-ATLAS and UI-TARS. It achieves state-of-the-art performance across three standard Mobile GUI-Agents benchmarks and shows strong generalization to out-of-domain scenarios.

Problem

Research questions and friction points this paper is trying to address.

Improving reasoning performance of VLM-based mobile agents in GUI tasks

Addressing scarcity of diverse CoaT trajectories for better generalization

Reducing reliance on expensive process-level annotations for training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Iterative Preference Learning for CoaT-tree construction

Thinking-level Direct Preference Optimization pairs

Three-stage instruction evolution with GPT-4o

🔎 Similar Papers

MLLM as Retriever: Interactively Learning Multimodal Retrieval for Embodied Agents

2024-10-04International Conference on Learning RepresentationsCitations: 0

Apple

Santa Clara, United States of America

Authors to Follow