๐ค AI Summary
Large language models (LLMs) suffer from insufficient decision-making capability in complex mobile GUI tasks due to lack of domain-specific application knowledge. To address this, we propose a knowledge graphโdriven retrieval-augmented generation (RAG) framework that transforms sparse, low-quality UI transition graphs (UTGs) into structured vector knowledge graphs, incorporates intent-guided multi-hop retrieval, and enables real-time navigation path planning. Our contributions are fourfold: (1) the first UTG-enhanced RAG architecture tailored for the Chinese mobile ecosystem; (2) two cross-application benchmark suites; (3) state-of-the-art performance on mainstream mobile appsโ75.8% task success rate, 84.6% decision accuracy, and an average of 4.1 steps per task; and (4) empirical validation of strong generalizability to web and desktop GUI environments.
๐ Abstract
Despite recent progress, Graphic User Interface (GUI) agents powered by Large Language Models (LLMs) struggle with complex mobile tasks due to limited app-specific knowledge. While UI Transition Graphs (UTGs) offer structured navigation representations, they are underutilized due to poor extraction and inefficient integration. We introduce KG-RAG, a Knowledge Graph-driven Retrieval-Augmented Generation framework that transforms fragmented UTGs into structured vector databases for efficient real-time retrieval. By leveraging an intent-guided LLM search method, KG-RAG generates actionable navigation paths, enhancing agent decision-making. Experiments across diverse mobile apps show that KG-RAG outperforms existing methods, achieving a 75.8% success rate (8.9% improvement over AutoDroid), 84.6% decision accuracy (8.1% improvement), and reducing average task steps from 4.5 to 4.1. Additionally, we present KG-Android-Bench and KG-Harmony-Bench, two benchmarks tailored to the Chinese mobile ecosystem for future research. Finally, KG-RAG transfers to web/desktop (+40% SR on Weibo-web; +20% on QQ Music-desktop), and a UTG cost ablation shows accuracy saturates at ~4h per complex app, enabling practical deployment trade-offs.