OS-Oracle: A Comprehensive Framework for Cross-Platform GUI Critic Models

๐Ÿ“… 2025-12-18
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Vision-language models (VLMs) powering GUI agents suffer from error accumulation and irreversible actions in long-horizon tasks, severely limiting reliability. Method: We propose a trustworthy step-level critique modeling framework. We introduce the first cross-platform (Mobile/Web/Desktop) GUI critique data synthesis pipeline, yielding 310K high-quality samples; design a two-stage training paradigm comprising supervised fine-tuning (SFT) and consistency-preserving groupwise relative policy optimization (CP-GRPO); and jointly model multi-platform GUI states to generate synthetic, actionable feedback. Contribution/Results: We release OS-Critic Benchโ€”the first comprehensive, multi-platform GUI critique benchmark. Our OS-Oracle-7B model achieves state-of-the-art performance on this benchmark, outperforming all open-source VLMs and surpassing leading closed-source models on mobile tasks. Integrated as aๅ‰็ฝฎ critic, it significantly boosts task success rates of agents such as UI-TARS on OSWorld and AndroidWorld.

Technology Category

Application Category

๐Ÿ“ Abstract
With VLM-powered computer-using agents (CUAs) becoming increasingly capable at graphical user interface (GUI) navigation and manipulation, reliable step-level decision-making has emerged as a key bottleneck for real-world deployment. In long-horizon workflows, errors accumulate quickly and irreversible actions can cause unintended consequences, motivating critic models that assess each action before execution. While critic models offer a promising solution, their effectiveness is hindered by the lack of diverse, high-quality GUI feedback data and public critic benchmarks for step-level evaluation in computer use. To bridge these gaps, we introduce OS-Oracle that makes three core contributions: (1) a scalable data pipeline for synthesizing cross-platform GUI critic data; (2) a two-stage training paradigm combining supervised fine-tuning (SFT) and consistency-preserving group relative policy optimization (CP-GRPO); (3) OS-Critic Bench, a holistic benchmark for evaluating critic model performance across Mobile, Web, and Desktop platforms. Leveraging this framework, we curate a high-quality dataset containing 310k critic samples. The resulting critic model, OS-Oracle-7B, achieves state-of-the-art performance among open-source VLMs on OS-Critic Bench, and surpasses proprietary models on the mobile domain. Furthermore, when serving as a pre-critic, OS-Oracle-7B improves the performance of native GUI agents such as UI-TARS-1.5-7B in OSWorld and AndroidWorld environments. The code is open-sourced at https://github.com/numbmelon/OS-Oracle.
Problem

Research questions and friction points this paper is trying to address.

Lack of diverse GUI feedback data for critic models
Missing public benchmarks for step-level action evaluation
Need reliable decision-making in cross-platform GUI navigation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthesizes cross-platform GUI critic data pipeline
Uses SFT and CP-GRPO two-stage training paradigm
Introduces OS-Critic Bench for multi-platform evaluation
๐Ÿ”Ž Similar Papers
No similar papers found.