Boosting Virtual Agent Learning and Reasoning: A Step-wise, Multi-dimensional, and Generalist Reward Model with Benchmark

📅 2025-03-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current general-purpose virtual agents (GVAs) rely heavily on costly human annotations and coarse-grained outcome supervision, resulting in poor generalization and scalability. To address this, we propose Similar—the first fine-grained, multi-dimensional, stepwise reward model tailored for GVAs. Our approach introduces: (1) a five-dimensional action evaluation framework; (2) MCTS-P, an automated trajectory sampling algorithm leveraging Monte Carlo Tree Search with pruning; (3) Triple-M training—integrating task-aware, step-level, and multi-dimensional supervision; and (4) SRM, the first stepwise, multi-dimensional reward benchmark for GVAs, comprising SRMTrain and SRMEval splits. Experiments demonstrate that Similar significantly improves action selection quality, achieving state-of-the-art performance in fine-grained signal modeling, cross-task generalization, and system scalability over all baselines.

Technology Category

Application Category

📝 Abstract
The development of Generalist Virtual Agents (GVAs) powered by Multimodal Large Language Models (MLLMs) has shown significant promise in autonomous task execution. However, current training paradigms face critical limitations, including reliance on outcome supervision and labor-intensive human annotations. To address these challenges, we propose Similar, a Step-wise Multi-dimensional Generalist Reward Model, which offers fine-grained signals for agent training and can choose better action for inference-time scaling. Specifically, we begin by systematically defining five dimensions for evaluating agent actions. Building on this framework, we design an MCTS-P algorithm to automatically collect and annotate step-wise, five-dimensional agent execution data. Using this data, we train Similar with the Triple-M strategy. Furthermore, we introduce the first benchmark in the virtual agent domain for step-wise, multi-dimensional reward model training and evaluation, named SRM. This benchmark consists of two components: SRMTrain, which serves as the training set for Similar, and SRMEval, a manually selected test set for evaluating the reward model. Experimental results demonstrate that Similar, through its step-wise, multi-dimensional assessment and synergistic gain, provides GVAs with effective intermediate signals during both training and inference-time scaling. The code is available at https://github.com/Galery23/Similar-v1.
Problem

Research questions and friction points this paper is trying to address.

Enhance GVA training with fine-grained reward signals
Reduce reliance on human annotations in agent training
Establish benchmark for step-wise multi-dimensional evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Step-wise Multi-dimensional Generalist Reward Model
MCTS-P algorithm for automatic data annotation
Triple-M strategy for training reward model
🔎 Similar Papers
No similar papers found.
B
Bingchen Miao
Zhejiang University, Hangzhou, China; Ant Group, Hangzhou, China
Y
Yang Wu
Ant Group, Hangzhou, China
Minghe Gao
Minghe Gao
浙江大学
机器学习
Qifan Yu
Qifan Yu
Zhejiang University
MLLMmultimodal learningimage generation & editing
W
Wendong Bu
Zhejiang University, Hangzhou, China; Ant Group, Hangzhou, China
W
Wenqiao Zhang
Zhejiang University, Hangzhou, China
Yunfei Li
Yunfei Li
ByteDance Seed
Reinforcement LearningRobotics
Siliang Tang
Siliang Tang
Professor of Computer Science, Zhejiang University
Natural Language ProcessingCross-media AnalysisGraph Neural Network
T
Tat-Seng Chua
National University of Singapore, Kent Ridge, Singapore
Juncheng Li
Juncheng Li
East China Normal University
Super ResolutionImage RestorationComputer VisionMedical Image Analysis