🤖 AI Summary
To address privacy leakage, high computational overhead, and deployment challenges associated with cloud-dependent Computer-Using Agents (CUAs), this paper introduces the first fully local, lightweight vision-language model framework. Methodologically: (1) we propose an LLM-as-Judge paradigm for automated evaluation and filtering of synthetic GUI interaction trajectories, enabling high-quality DPO reinforcement learning data generation without human annotation; (2) we jointly optimize a compact vision encoder, instruction-tuned language model, and localized GUI action modeling module. On the OS-World benchmark, our approach significantly outperforms existing baselines, achieving synergistic advances in three critical dimensions: strict privacy preservation (full on-device execution), edge inference speed (3.2× faster), and cross-application generalization. This work establishes a new paradigm for trustworthy, resource-efficient CUAs on constrained devices.
📝 Abstract
Computer use agents (CUA) are systems that automatically interact with graphical user interfaces (GUIs) to complete tasks. CUA have made significant progress with the advent of large vision-language models (VLMs). However, these agents typically rely on cloud-based inference with substantial compute demands, raising critical privacy and scalability concerns, especially when operating on personal devices. In this work, we take a step toward privacy-preserving and resource-efficient agents by developing a lightweight vision-language model that runs entirely on local machines. To train this compact agent, we introduce an LLM-as-Judge framework that automatically evaluates and filters synthetic interaction trajectories, producing high-quality data for reinforcement learning without human annotation. Experiments on the OS-World benchmark demonstrate that our fine-tuned local model outperforms existing baselines, highlighting a promising path toward private, efficient, and generalizable GUI agents.