๐ค AI Summary
To address the challenge of efficient UI control on resource-constrained mobile devices, this paper proposes LiMAC, a lightweight multimodal mobile control architecture. LiMAC introduces an on-device compact Action Transformer (AcT) and synergistically integrates it with fine-tuned vision-language models (VLMs)โspecifically Florence-2 and Qwen2-VLโto jointly model historical screenshots, the current UI hierarchy tree, and textual task instructions. By jointly encoding UI trees and visual features and applying multimodal sequence modeling, LiMAC achieves low-latency, high-accuracy action generation. Evaluated on public benchmarks, LiMAC attains an action accuracy of 86.3%, outperforming fine-tuned VLMs by 19% and GPT-4o prompt-engineering baselines by 42%. It significantly surpasses existing on-device approaches, establishing the first lightweight, real-time, and accurate UI automation paradigm for resource-limited edge devices.
๐ Abstract
This paper introduces a novel mobile phone control architecture, Lightweight Multi-modal App Control (LiMAC), for efficient interactions and control across various Android apps. LiMAC takes as input a textual goal and a sequence of past mobile observations, such as screenshots and corresponding UI trees, to generate precise actions. To address the computational constraints inherent to smartphones, we introduce a small Action Transformer (AcT) integrated with a fine-tuned vision-language model (VLM) for real-time decision-making and task execution. We evaluate LiMAC on two open-source mobile control datasets, demonstrating the superior performance of our small-form-factor approach against fine-tuned versions of open-source VLMs, such as Florence2 and Qwen2-VL. It also significantly outperforms prompt engineering baselines utilising closed-source foundation models like GPT-4o. More specifically, LiMAC increases the overall action accuracy by up to 19% compared to fine-tuned VLMs, and up to 42% compared to prompt-engineering baselines.