🤖 AI Summary
This work addresses the challenges of memory decay, progress confusion, and arithmetic hallucination that commonly impair multimodal large language models in long-horizon GUI automation tasks. To mitigate these issues, the authors propose UI-Copilot, a framework that decouples task execution from on-demand assistance through a memory disentanglement mechanism. It integrates lightweight Retriever and Calculator collaborators to provide precise memory retrieval and numerical computation support. Furthermore, the framework employs a Tool Integration Policy Optimization (TIPO) training strategy that enables staged optimization of tool invocation and task execution. Experimental results demonstrate that UI-Copilot-7B achieves state-of-the-art performance on MemGUI-Bench and improves absolute accuracy by 17.1% over the Qwen baseline on AndroidWorld, significantly enhancing robustness and accuracy in long-horizon GUI automation.
📝 Abstract
MLLM-based GUI agents have demonstrated strong capabilities in complex user interface interaction tasks. However, long-horizon scenarios remain challenging, as these agents are burdened with tasks beyond their intrinsic capabilities, suffering from memory degradation, progress confusion, and math hallucination. To address these challenges, we present UI-Copilot, a collaborative framework where the GUI agent focuses on task execution while a lightweight copilot provides on-demand assistance for memory retrieval and numerical computation. We introduce memory decoupling to separate persistent observations from transient execution context, and train the policy agent to selectively invoke the copilot as Retriever or Calculator based on task demands. To enable effective tool invocation learning, we propose Tool-Integrated Policy Optimization (TIPO), which separately optimizes tool selection through single-turn prediction and task execution through on-policy multi-turn rollouts. Experimental results show that UI-Copilot-7B achieves state-of-the-art performance on challenging MemGUI-Bench, outperforming strong 7B-scale GUI agents such as GUI-Owl-7B and UI-TARS-1.5-7B. Moreover, UI-Copilot-7B delivers a 17.1% absolute improvement on AndroidWorld over the base Qwen model, highlighting UI-Copilot's strong generalization to real-world GUI tasks.