SecAgent: Efficient Mobile GUI Agent with Semantic Context

πŸ“… 2026-03-09
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the limitations of existing mobile GUI agents, which suffer from the scarcity of high-quality non-English data and inefficient representation of historical interaction context. To bridge this gap, the authors construct the first high-quality mobile GUI dataset tailored to the Chinese ecosystem, comprising 18k annotated samples and 121k navigation steps, along with a multi-choice action evaluation benchmark. They further propose a semantic context mechanism that compresses historical screenshots and actions into natural language summaries, preserving task-critical information while substantially reducing computational overhead. A 3B-parameter agent, SecAgent, trained via supervised and reinforcement fine-tuning, achieves superior performance over same-scale models on both the newly introduced and public benchmarks, matching the efficacy of 7B–8B models while significantly improving inference efficiency.

Technology Category

Application Category

πŸ“ Abstract
Mobile Graphical User Interface (GUI) agents powered by multimodal large language models have demonstrated promising capabilities in automating complex smartphone tasks. However, existing approaches face two critical limitations: the scarcity of high-quality multilingual datasets, particularly for non-English ecosystems, and inefficient history representation methods. To address these challenges, we present SecAgent, an efficient mobile GUI agent at 3B scale. We first construct a human-verified Chinese mobile GUI dataset with 18k grounding samples and 121k navigation steps across 44 applications, along with a Chinese navigation benchmark featuring multi-choice action annotations. Building upon this dataset, we propose a semantic context mechanism that distills history screenshots and actions into concise, natural language summaries, significantly reducing computational costs while preserving task-relevant information. Through supervised and reinforcement fine-tuning, SecAgent outperforms similar-scale baselines and achieves performance comparable to 7B-8B models on our and public navigation benchmarks. We will open-source the training dataset, benchmark, model, and code to advance research in multilingual mobile GUI automation.
Problem

Research questions and friction points this paper is trying to address.

mobile GUI agent
multilingual dataset scarcity
inefficient history representation
non-English ecosystems
smartphone task automation
Innovation

Methods, ideas, or system contributions that make the work stand out.

mobile GUI agent
semantic context
multilingual dataset
efficient history representation
reinforcement fine-tuning
πŸ”Ž Similar Papers
No similar papers found.