AppVLM: A Lightweight Vision Language Model for Online App Control

📅 2025-02-10

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

Smartphone assistants (App Agents) face high computational overhead and poor environmental adaptability when deployed on-device. Method: This paper proposes AppVLM, a lightweight vision-language model, introducing a novel two-stage paradigm: offline pretraining followed by iterative online fine-tuning within the AndroidWorld environment. AppVLM integrates a ViT-CLIP visual encoder with a lightweight LLaMA-based language decoder, and employs joint optimization via AndroidControl supervised fine-tuning and AndroidWorld reinforcement feedback. Contribution/Results: AppVLM achieves state-of-the-art action prediction accuracy on the AndroidControl offline benchmark. In online AndroidWorld tasks, it matches GPT-4o’s task success rate while reducing inference latency by 90% and accelerating inference speed tenfold. To our knowledge, AppVLM is the first on-device VLM enabling high-performance, low-latency, instruction-driven GUI control—establishing a deployable VLM paradigm for mobile agents.

Technology Category

Application Category

📝 Abstract

The utilisation of foundation models as smartphone assistants, termed app agents, is a critical research challenge. These agents aim to execute human instructions on smartphones by interpreting textual instructions and performing actions via the device's interface. While promising, current approaches face significant limitations. Methods that use large proprietary models, such as GPT-4o, are computationally expensive, while those that use smaller fine-tuned models often lack adaptability to out-of-distribution tasks. In this work, we introduce AppVLM, a lightweight Vision-Language Model (VLM). First, we fine-tune it offline on the AndroidControl dataset. Then, we refine its policy by collecting data from the AndroidWorld environment and performing further training iterations. Our results indicate that AppVLM achieves the highest action prediction accuracy in offline evaluation on the AndroidControl dataset, compared to all evaluated baselines, and matches GPT-4o in online task completion success rate in the AndroidWorld environment, while being up to ten times faster. This makes AppVLM a practical and efficient solution for real-world deployment.

Problem

Research questions and friction points this paper is trying to address.

Lightweight Vision-Language Model

Smartphone App Control

Efficient Task Execution

Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight Vision-Language Model

Offline fine-tuning on AndroidControl

Online refinement in AndroidWorld

🔎 Similar Papers

No similar papers found.