AgentV-RL: Scaling Reward Modeling with Agentic Verifier

๐Ÿ“… 2026-04-17
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

208K/year
๐Ÿค– AI Summary
Traditional verifiers are prone to unreliable evaluation due to error propagation in intermediate reasoning steps and insufficient access to external knowledge. To address this, this work proposes the Agentic Verifier framework, which introduces a novel forwardโ€“backward agent collaboration mechanism. It reformulates reward modeling as a multi-turn, tool-augmented bidirectional reasoning process, enabling active exploration and dynamic tool invocation. This approach yields interpretable and highly reliable verification outcomes. Notably, with only a 4B-parameter model under test-time scaling settings, the method outperforms the current state-of-the-art outcome reward model (ORM) by 25.2%, achieving substantially improved verification accuracy on complex reasoning tasks.

Technology Category

Application Category

๐Ÿ“ Abstract
Verifiers have been demonstrated to enhance LLM reasoning via test-time scaling (TTS). Yet, they face significant challenges in complex domains. Error propagation from incorrect intermediate reasoning can lead to false positives for seemingly plausible solutions, while lacking external grounding makes verifiers unreliable on computation or knowledge-intensive tasks. To address these challenges, we propose Agentic Verifier, a framework that transforms reward modeling into a multi-turn, tool-augmented deliberative process. We introduce complementary forward and backward agents: one traces solutions from premises to conclusions, while the other re-checks conclusions against their underlying premises. This bidirectional process enables a comprehensive, reliable, and interpretable assessment of solutions. To facilitate practical deployment, we propose AgentV-RL. Through proactive exploration and reinforcement learning, the verifier autonomously interleaves tool-use with internal reasoning. Extensive experiments show that Agentic Verifier yields consistent performance gains under both parallel and sequential TTS. Notably, our 4B variant surpasses state-of-the-art ORMs by 25.2%, positioning it as a promising paradigm for agentic reward modeling.
Problem

Research questions and friction points this paper is trying to address.

verifier
error propagation
external grounding
reward modeling
complex reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Agentic Verifier
Reward Modeling
Tool-Augmented Reasoning
Bidirectional Verification
Reinforcement Learning
๐Ÿ”Ž Similar Papers
No similar papers found.
Jiazheng Zhang
Jiazheng Zhang
Fudan University
Large Language ModelNatural Language ProcessingData Mining
Z
Ziche Fu
College of Computer Science and Artificial Intelligence, Fudan University
Zhiheng Xi
Zhiheng Xi
Fudan University
LLM ReasoningLLM-based Agents
W
Wenqing Jing
College of Computer Science and Artificial Intelligence, Fudan University
Mingxu Chai
Mingxu Chai
Fudan University
Wei He
Wei He
Fudan University
LLM ReasoningLLM-based Agent
G
Guoqiang Zhang
College of Computer Science and Artificial Intelligence, Fudan University
Chenghao Fan
Chenghao Fan
School of Comp. Sci., Huazhong University of Science and Technology, Wuhan, China
Natural Language ProcessingLLM
Chenxin An
Chenxin An
The University of Hong Kong
Long-context LLMs
Wenxiang Chen
Wenxiang Chen
Fudan University
LLM reasoningLLM-based agent
Zhicheng Liu
Zhicheng Liu
ByteDance LLM Team (Seed)
LLM
Haojie Pan
Haojie Pan
Hong Kong University of Science and Technology
World KnowledgeNatural Language ProcessingText Mining
D
Dingwei Zhu
College of Computer Science and Artificial Intelligence, Fudan University
T
Tao Gui
Institute of Trustworthy Embodied AI, Fudan University; Shanghai Key Laboratory of Multimodal Embodied AI
Qi Zhang
Qi Zhang
Fudan University
SAGINsatellite routing
X
Xuanjing Huang
Institute of Trustworthy Embodied AI, Fudan University; Shanghai Key Laboratory of Multimodal Embodied AI