Co-EPG: A Framework for Co-Evolution of Planning and Grounding in Autonomous GUI Agents

📅 2025-11-13

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

Existing GUI task automation approaches suffer from two key limitations: insufficient cross-model coordination and inefficient utilization of synthetic data. To address these, we propose Co-EPG—a novel framework that establishes, for the first time, a closed-loop co-evolutionary mechanism between planning and grounding capabilities. Co-EPG employs self-play-driven iterative training to enable mutual refinement of both components without requiring external annotations. Its core techniques include Group Relative Policy Optimization (GRPO), reward-guided policy exploration, joint optimization of grounding models, and synthetic data distillation. Evaluated on the Multimodal-Mind2Web and AndroidControl benchmarks, Co-EPG surpasses prior state-of-the-art methods after only three training iterations, demonstrating robust and efficient continual improvement. This work introduces a scalable, annotation-free co-evolution paradigm for autonomous GUI agents.

Technology Category

Application Category

📝 Abstract

Graphical User Interface (GUI) task automation constitutes a critical frontier in artificial intelligence research. While effective GUI agents synergistically integrate planning and grounding capabilities, current methodologies exhibit two fundamental limitations: (1) insufficient exploitation of cross-model synergies, and (2) over-reliance on synthetic data generation without sufficient utilization. To address these challenges, we propose Co-EPG, a self-iterative training framework for Co-Evolution of Planning and Grounding. Co-EPG establishes an iterative positive feedback loop: through this loop, the planning model explores superior strategies under grounding-based reward guidance via Group Relative Policy Optimization (GRPO), generating diverse data to optimize the grounding model. Concurrently, the optimized Grounding model provides more effective rewards for subsequent GRPO training of the planning model, fostering continuous improvement. Co-EPG thus enables iterative enhancement of agent capabilities through self-play optimization and training data distillation. On the Multimodal-Mind2Web and AndroidControl benchmarks, our framework outperforms existing state-of-the-art methods after just three iterations without requiring external data. The agent consistently improves with each iteration, demonstrating robust self-enhancement capabilities. This work establishes a novel training paradigm for GUI agents, shifting from isolated optimization to an integrated, self-driven co-evolution approach.

Problem

Research questions and friction points this paper is trying to address.

Addresses insufficient synergy between planning and grounding in GUI agents

Reduces over-reliance on synthetic data generation for GUI automation

Establishes co-evolution framework for iterative self-improvement of GUI agents

Innovation

Methods, ideas, or system contributions that make the work stand out.

Co-EPG enables co-evolution of planning and grounding

Uses GRPO for planning with grounding-based rewards

Self-iterative training via data distillation and self-play

🔎 Similar Papers

Visual grounding for desktop graphical user interfaces