Lightweight Neural App Control

📅 2024-10-23

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

209K/year

🤖 AI Summary

To address the challenge of efficient UI control on resource-constrained mobile devices, this paper proposes LiMAC, a lightweight multimodal mobile control architecture. LiMAC introduces an on-device compact Action Transformer (AcT) and synergistically integrates it with fine-tuned vision-language models (VLMs)—specifically Florence-2 and Qwen2-VL—to jointly model historical screenshots, the current UI hierarchy tree, and textual task instructions. By jointly encoding UI trees and visual features and applying multimodal sequence modeling, LiMAC achieves low-latency, high-accuracy action generation. Evaluated on public benchmarks, LiMAC attains an action accuracy of 86.3%, outperforming fine-tuned VLMs by 19% and GPT-4o prompt-engineering baselines by 42%. It significantly surpasses existing on-device approaches, establishing the first lightweight, real-time, and accurate UI automation paradigm for resource-limited edge devices.

Technology Category

Application Category

📝 Abstract

This paper introduces a novel mobile phone control architecture, Lightweight Multi-modal App Control (LiMAC), for efficient interactions and control across various Android apps. LiMAC takes as input a textual goal and a sequence of past mobile observations, such as screenshots and corresponding UI trees, to generate precise actions. To address the computational constraints inherent to smartphones, we introduce a small Action Transformer (AcT) integrated with a fine-tuned vision-language model (VLM) for real-time decision-making and task execution. We evaluate LiMAC on two open-source mobile control datasets, demonstrating the superior performance of our small-form-factor approach against fine-tuned versions of open-source VLMs, such as Florence2 and Qwen2-VL. It also significantly outperforms prompt engineering baselines utilising closed-source foundation models like GPT-4o. More specifically, LiMAC increases the overall action accuracy by up to 19% compared to fine-tuned VLMs, and up to 42% compared to prompt-engineering baselines.

Problem

Research questions and friction points this paper is trying to address.

Efficient mobile app control

Real-time decision-making on smartphones

Improving action accuracy in app interactions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight Multi-modal App Control

Action Transformer integration

Fine-tuned vision-language model

🔎 Similar Papers

Naeural AI OS -- Decentralized ubiquitous computing MLOps execution engine