Lightweight Neural App Control

๐Ÿ“… 2024-10-23
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 1
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address the challenge of efficient UI control on resource-constrained mobile devices, this paper proposes LiMAC, a lightweight multimodal mobile control architecture. LiMAC introduces an on-device compact Action Transformer (AcT) and synergistically integrates it with fine-tuned vision-language models (VLMs)โ€”specifically Florence-2 and Qwen2-VLโ€”to jointly model historical screenshots, the current UI hierarchy tree, and textual task instructions. By jointly encoding UI trees and visual features and applying multimodal sequence modeling, LiMAC achieves low-latency, high-accuracy action generation. Evaluated on public benchmarks, LiMAC attains an action accuracy of 86.3%, outperforming fine-tuned VLMs by 19% and GPT-4o prompt-engineering baselines by 42%. It significantly surpasses existing on-device approaches, establishing the first lightweight, real-time, and accurate UI automation paradigm for resource-limited edge devices.

Technology Category

Application Category

๐Ÿ“ Abstract
This paper introduces a novel mobile phone control architecture, Lightweight Multi-modal App Control (LiMAC), for efficient interactions and control across various Android apps. LiMAC takes as input a textual goal and a sequence of past mobile observations, such as screenshots and corresponding UI trees, to generate precise actions. To address the computational constraints inherent to smartphones, we introduce a small Action Transformer (AcT) integrated with a fine-tuned vision-language model (VLM) for real-time decision-making and task execution. We evaluate LiMAC on two open-source mobile control datasets, demonstrating the superior performance of our small-form-factor approach against fine-tuned versions of open-source VLMs, such as Florence2 and Qwen2-VL. It also significantly outperforms prompt engineering baselines utilising closed-source foundation models like GPT-4o. More specifically, LiMAC increases the overall action accuracy by up to 19% compared to fine-tuned VLMs, and up to 42% compared to prompt-engineering baselines.
Problem

Research questions and friction points this paper is trying to address.

Efficient mobile app control
Real-time decision-making on smartphones
Improving action accuracy in app interactions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight Multi-modal App Control
Action Transformer integration
Fine-tuned vision-language model
๐Ÿ”Ž Similar Papers
No similar papers found.