CUARewardBench: A Benchmark for Evaluating Reward Models on Computer-using Agent

📅 2025-10-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing reward models for computer-use agents (CUAs) lack systematic evaluation benchmarks; scripted validation methods are ill-suited for scalable, fine-grained process assessment. Method: We introduce the first dedicated benchmark for CUA reward modeling, enabling joint, dual-dimensional evaluation—covering both task outcomes and execution processes—across diverse scenarios: 7 vision-language models (VLMs), 3 prompt template categories, and heterogeneous software environments. We propose Unified Prompt Ensemble (UPE), a novel integration method leveraging expert-annotated trajectories and rigorous quality control. Contribution/Results: UPE achieves 89.8% accuracy (NPV 93.3%) for outcome evaluation and 81.7% accuracy (NPV 85.1%) for process evaluation—surpassing single-model and conventional ensemble baselines. The benchmark uncovers critical limitations in visual reasoning and knowledge coverage, establishing a standardized evaluation paradigm for CUA reward modeling.

Technology Category

Application Category

📝 Abstract
Computer-using agents (CUAs) enable task completion through natural interaction with operating systems and software interfaces. While script-based verifiers are widely adopted for evaluation, they suffer from limited scalability and inability to provide step-wise assessment. Reward models offer promising alternatives, but their effectiveness on CUA evaluation remains largely underexplored. To address this gap, we present CUARewardBench, comprising four key contributions: (1) First-ever Comprehensive CUA Reward Benchmark: We introduce the first benchmark for evaluating both outcome reward models (ORM) and process reward models (PRM) on CUA tasks, enabling systematic assessment across trajectory-level and step-level evaluation. (2) Diverse, Practical and Reliable Dataset: CUARewardBench encompasses trajectories from 10 software categories and 7 agent architectures with varying performance levels (25.9%-50.8% success rates). All trajectories are expertly annotated through carefully designed protocols, with rigorous quality control to ensure reliability and practical applicability. (3) Comprehensive Analysis and Insights: Through extensive experiments across 7 vision-language models and 3 prompt templates, we reveal critical limitations of current CUA RMs, including insufficient visual reasoning capabilities, knowledge deficiencies, and the superiority of general VLMs over specialized CUA models for reward evaluation. (4) Unanimous Prompt Ensemble (UPE): Based on the insights from our comprehensive analysis, we propose UPE, a novel ensemble method that significantly enhances reward model reliability through strict unanimous voting and strategic prompt-template configurations. UPE achieves 89.8% precision and 93.3% NPV for ORM, and 81.7% precision and 85.1% NPV for PRM, substantially outperforming single VLMs and traditional ensemble approaches.
Problem

Research questions and friction points this paper is trying to address.

Evaluating reward models for computer-using agents systematically
Overcoming limitations of script-based verifiers in agent assessment
Benchmarking both outcome and process reward models for CUAs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduced first comprehensive benchmark for computer-using agent reward models
Proposed novel unanimous prompt ensemble method for reliability
Achieved high precision through strategic prompt-template configurations
🔎 Similar Papers
No similar papers found.