ARM-Thinker: Reinforcing Multimodal Generative Reward Models with Agentic Tool Use and Visual Reasoning

πŸ“… 2025-12-04
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing vision-language reward models suffer from hallucination, weak visual grounding, and lack of verifiability in complex multimodal reasoning. To address these issues, we propose ARM-Thinkerβ€”the first multimodal reward model capable of autonomous tool invocation. It leverages tools such as image cropping and cross-page document retrieval to generate verifiable visual and semantic evidence, enabling fine-grained verification and traceable reasoning. Our approach innovatively integrates agent-style tool use into reward modeling, departing from static scoring paradigms; employs multi-stage reinforcement learning to jointly optimize tool selection and judgment accuracy; and introduces ARMBench-VL, a new benchmark covering image-level, text-level, and multi-page document understanding tasks. Experiments show that ARM-Thinker achieves an average 16.2% improvement over mainstream reward modeling benchmarks, a 9.6% gain on tool-utilizing tasks, and substantial gains in multimodal mathematical and logical reasoning over baselines.

Technology Category

Application Category

πŸ“ Abstract
Reward models are critical for aligning vision-language systems with human preferences, yet current approaches suffer from hallucination, weak visual grounding, and an inability to use tools for verification, limiting their reliability on complex multimodal reasoning tasks. We present ARM-Thinker, an A}gentic multimodal Reward Model that autonomously invokes external tools (e.g., image cropping, doc page retrieval) to ground judgments in verifiable evidence, replacing static, non-interactive reward scoring. This enables the model to verify fine-grained visual details, cross-reference multi-page evidence, and validate reasoning claims, which are capabilities absent in existing reward models. We train ARM-Thinker with multi-stage reinforcement learning, jointly optimizing tool-calling decisions and judgment accuracy. To evaluate agentic reward modeling, we introduce ARMBench-VL, comprising three benchmarks that assess fine-grained visual grounding (image-level tools), multi-page document understanding (retrieval tools), and instruction following (text-level verification). ARM-Thinker achieves +16.2% average improvement on reward modeling benchmarks, +9.6% on tool-use tasks, and outperforms baselines on multimodal math and logical reasoning benchmarks. Our results demonstrate that agentic capabilities significantly enhance both accuracy and interpretability of reward models.
Problem

Research questions and friction points this paper is trying to address.

Enhances reward models with tool use for verification
Addresses hallucination and weak visual grounding in VLMs
Improves multimodal reasoning via agentic evidence validation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Agentic reward model autonomously invokes external tools for verification
Multi-stage reinforcement learning optimizes tool-calling and judgment accuracy
Introduces benchmarks for fine-grained visual and multi-page document understanding
πŸ”Ž Similar Papers
No similar papers found.
Shengyuan Ding
Shengyuan Ding
Fudan University
Multimodal Learning
X
Xinyu Fang
Shanghai Artificial Intelligence Laboratory, Zhejiang University
Z
Ziyu Liu
Shanghai Artificial Intelligence Laboratory, Shanghai Jiao Tong University
Yuhang Zang
Yuhang Zang
Shanghai AI Laboratory
Natural Language ProcessingVision Language Model
Yuhang Cao
Yuhang Cao
MMLab The Chinese University of Hong Kong
Multi-Modal Large Language ModelObject DetectionFew Shot Object Detection
X
Xiangyu Zhao
Shanghai Artificial Intelligence Laboratory
Haodong Duan
Haodong Duan
Shanghai AI Lab | CUHK | PKU
Computer VisionVideo UnderstandingMultimodal LearningGenerative AI
Xiaoyi Dong
Xiaoyi Dong
Microsoft GenAI
Computer Vision
J
Jianze Liang
Shanghai Artificial Intelligence Laboratory
B
Bin Wang
Shanghai Artificial Intelligence Laboratory
Conghui He
Conghui He
Shanghai AI Laboratory
Data-centric AILLMDocument Intelligence
Dahua Lin
Dahua Lin
The Chinese University of Hong Kong
computer visionmachine learningprobabilistic inferencebayesian nonparametrics
J
Jiaqi Wang
Shanghai Artificial Intelligence Laboratory, Shanghai Innovation Institute