DPO Learning with LLMs-Judge Signal for Computer Use Agents

📅 2025-06-03

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

To address privacy leakage, high computational overhead, and deployment challenges associated with cloud-dependent Computer-Using Agents (CUAs), this paper introduces the first fully local, lightweight vision-language model framework. Methodologically: (1) we propose an LLM-as-Judge paradigm for automated evaluation and filtering of synthetic GUI interaction trajectories, enabling high-quality DPO reinforcement learning data generation without human annotation; (2) we jointly optimize a compact vision encoder, instruction-tuned language model, and localized GUI action modeling module. On the OS-World benchmark, our approach significantly outperforms existing baselines, achieving synergistic advances in three critical dimensions: strict privacy preservation (full on-device execution), edge inference speed (3.2× faster), and cross-application generalization. This work establishes a new paradigm for trustworthy, resource-efficient CUAs on constrained devices.

Technology Category

Application Category

📝 Abstract

Computer use agents (CUA) are systems that automatically interact with graphical user interfaces (GUIs) to complete tasks. CUA have made significant progress with the advent of large vision-language models (VLMs). However, these agents typically rely on cloud-based inference with substantial compute demands, raising critical privacy and scalability concerns, especially when operating on personal devices. In this work, we take a step toward privacy-preserving and resource-efficient agents by developing a lightweight vision-language model that runs entirely on local machines. To train this compact agent, we introduce an LLM-as-Judge framework that automatically evaluates and filters synthetic interaction trajectories, producing high-quality data for reinforcement learning without human annotation. Experiments on the OS-World benchmark demonstrate that our fine-tuned local model outperforms existing baselines, highlighting a promising path toward private, efficient, and generalizable GUI agents.

Problem

Research questions and friction points this paper is trying to address.

Develop lightweight vision-language model for local use

Address privacy and scalability in computer use agents

Automate data quality evaluation with LLM-as-Judge framework

Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight vision-language model for local use

LLM-as-Judge framework for automatic evaluation

Synthetic interaction trajectories for reinforcement learning

🔎 Similar Papers

A Survey on Large Language Model based Autonomous Agents