OS-Themis: A Scalable Critic Framework for Generalist GUI Rewards

📅 2026-03-19

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

This work addresses the challenge in GUI-based intelligent agents where existing reinforcement learning reward functions struggle to balance scalability with evaluation accuracy, thereby limiting training performance. To overcome this, the authors propose OS-Themis, a novel framework that introduces a milestone-decomposition-based multi-agent critic mechanism and a chain-of-evidence auditing mechanism. These components decompose task trajectories into verifiable subgoals and enhance reward signal quality through a self-training loop with trajectory filtering. The approach significantly improves both the fidelity of reward assessment and system scalability, achieving a 10.3% gain in online RL training performance on AndroidWorld and a 6.9% improvement in self-training trajectory filtering. Furthermore, OS-Themis demonstrates consistent superiority over prior methods on the newly introduced OGRBench benchmark.

Technology Category

Application Category

📝 Abstract

Reinforcement Learning (RL) has the potential to improve the robustness of GUI agents in stochastic environments, yet training is highly sensitive to the quality of the reward function. Existing reward approaches struggle to achieve both scalability and performance. To address this, we propose OS-Themis, a scalable and accurate multi-agent critic framework. Unlike a single judge, OS-Themis decomposes trajectories into verifiable milestones to isolate critical evidence for decision making and employs a review mechanism to strictly audit the evidence chain before making the final verdict. To facilitate evaluation, we further introduce OmniGUIRewardBench (OGRBench), a holistic cross-platform benchmark for GUI outcome rewards, where all evaluated models achieve their best performance under OS-Themis. Extensive experiments on AndroidWorld show that OS-Themis yields a 10.3% improvement when used to support online RL training, and a 6.9% gain when used for trajectory validation and filtering in the self-training loop, highlighting its potential to drive agent evolution.

Problem

Research questions and friction points this paper is trying to address.

GUI agents

reward function

scalability

performance

reinforcement learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

scalable critic framework

multi-agent reward modeling

trajectory decomposition