Video-Based Reward Modeling for Computer-Use Agents

📅 2026-03-10

📈 Citations: 0

✨ Influential: 0

career value

246K/year

🤖 AI Summary

This work addresses the challenge of scalable, model-agnostic evaluation of computer-using agents, particularly the absence of general-purpose metrics that do not rely on internal reasoning traces. To this end, the authors propose ExeVRM—a reward model that assesses task success solely from user instructions and screen-recorded execution videos. They introduce ExeVR-53k, the first large-scale dataset of video-task-reward triplets, augmented with adversarially translated instructions to generate step-annotated negative samples. To efficiently process high-resolution, long-duration videos, they design a spatiotemporal token pruning strategy. The resulting 8B-parameter ExeVRM achieves 84.7% accuracy and 87.7% recall in cross-platform evaluations (Ubuntu, macOS, Windows, Android), substantially outperforming GPT-5.2 and Gemini-3 Pro, while also enabling precise temporal localization of task outcomes.

Technology Category

Application Category

📝 Abstract

Computer-using agents (CUAs) are becoming increasingly capable; however, it remains difficult to scale evaluation of whether a trajectory truly fulfills a user instruction. In this work, we study reward modeling from execution video: a sequence of keyframes from an agent trajectory that is independent of the agent's internal reasoning or actions. Although video-execution modeling is method-agnostic, it presents key challenges, including highly redundant layouts and subtle, localized cues that determine success. We introduce Execution Video Reward 53k (ExeVR-53k), a dataset of 53k high-quality video--task--reward triplets. We further propose adversarial instruction translation to synthesize negative samples with step-level annotations. To enable learning from long, high-resolution execution videos, we design spatiotemporal token pruning, which removes homogeneous regions and persistent tokens while preserving decisive UI changes. Building on these components, we fine-tune an Execution Video Reward Model (ExeVRM) that takes only a user instruction and a video-execution sequence to predict task success. Our ExeVRM 8B achieves 84.7% accuracy and 87.7% recall on video-execution assessment, outperforming strong proprietary models such as GPT-5.2 and Gemini-3 Pro across Ubuntu, macOS, Windows, and Android, while providing more precise temporal attribution. These results show that video-execution reward modeling can serve as a scalable, model-agnostic evaluator for CUAs.

Problem

Research questions and friction points this paper is trying to address.

computer-using agents

reward modeling

video-based evaluation

task success assessment

execution video

Innovation

Methods, ideas, or system contributions that make the work stand out.

video-based reward modeling

spatiotemporal token pruning

adversarial instruction translation