Multimodal Reinforcement Learning with Agentic Verifier for AI Agents

📅 2025-12-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multimodal reinforcement learning (MMRL) agents rely predominantly on sparse outcome rewards, limiting their ability to model fine-grained capabilities such as stepwise reasoning and spatiotemporal localization, while remaining vulnerable to noisy teacher signals and reward hacking. Method: We propose Argos—a framework featuring a hybrid scoring function pool (incorporating rule-based metrics, teacher-model scoring, and supervised fine-tuning–guided filtering) and a dynamically switchable proxy reward mechanism that jointly evaluates reasoning quality, spatiotemporal localization accuracy, and final answer correctness. We provide theoretical guarantees of Pareto optimality for the proposed reward design, effectively mitigating unfounded reasoning and reward deception. Contribution/Results: Leveraging online validation–driven MMRL training, Argos achieves state-of-the-art performance on benchmarks spanning spatial reasoning, visual hallucination detection, and embodied AI, while significantly improving training stability and generalization across diverse tasks.

Technology Category

Application Category

📝 Abstract
Agentic reasoning models trained with multimodal reinforcement learning (MMRL) have become increasingly capable, yet they are almost universally optimized using sparse, outcome-based rewards computed based on the final answers. Richer rewards computed from the reasoning tokens can improve learning significantly by providing more fine-grained guidance. However, it is challenging to compute more informative rewards in MMRL beyond those based on outcomes since different samples may require different scoring functions and teacher models may provide noisy reward signals too. In this paper, we introduce the Argos (Agentic Reward for Grounded&Objective Scoring), a principled reward agent to train multimodal reasoning models for agentic tasks. For each sample, Argos selects from a pool of teacher-model derived and rule-based scoring functions to simultaneously evaluate: (i) final response accuracy, (ii) spatiotemporal localization of referred entities and actions, and (iii) the quality of the reasoning process. We find that by leveraging our agentic verifier across both SFT data curation and RL training, our model achieves state-of-the-art results across multiple agentic tasks such as spatial reasoning, visual hallucination as well as robotics and embodied AI benchmarks. Critically, we demonstrate that just relying on SFT post-training on highly curated reasoning data is insufficient, as agents invariably collapse to ungrounded solutions during RL without our online verification. We also show that our agentic verifier can help to reduce reward-hacking in MMRL. Finally, we also provide a theoretical justification for the effectiveness of Argos through the concept of pareto-optimality.
Problem

Research questions and friction points this paper is trying to address.

Develops a reward agent for multimodal reinforcement learning tasks
Addresses sparse and noisy reward signals in agentic reasoning models
Improves evaluation of response accuracy, localization, and reasoning quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

Agentic verifier selects scoring functions for multimodal tasks
Evaluates response accuracy, entity localization, reasoning quality
Reduces reward-hacking and improves state-of-the-art performance
🔎 Similar Papers
No similar papers found.
R
Reuben Tan
Microsoft Research
Baolin Peng
Baolin Peng
Microsoft Research, Redmond
NLPDialogFoundation ModelsAlignment
Zhengyuan Yang
Zhengyuan Yang
Principal Researcher, Microsoft
Computer VisionMultimediaMultimodalPost-TrainingAgentic RL
H
Hao Cheng
Microsoft Research
Oier Mees
Oier Mees
Microsoft
RoboticsMachine LearningComputer VisionRobot Learning
T
Theodore Zhao
Microsoft Research
A
Andrea Tupini
Microsoft Research
I
Isar Meijier
Microsoft Research
Q
Qianhui Wu
Microsoft Research
Yuncong Yang
Yuncong Yang
University of Massachusetts Amherst
Artificial IntelligenceComputer VisionRobotics
L
Lars Lidén
Microsoft Research
Y
Yu Gu
Microsoft Research
S
Sheng Zhang
Microsoft Research
X
Xiaodong Liu
Microsoft Research
L
Lijuan Wang
Microsoft Research
Marc Pollefeys
Marc Pollefeys
Professor of Computer Science, ETH Zurich, and Director Spatial AI Lab, Microsoft
Computer VisionComputer GraphicsRoboticsMachine LearningAugmented Reality
Y
Yong Jae Lee
UW-Madison
J
Jianfeng Gao
Microsoft Research