ReARTeR: Retrieval-Augmented Reasoning with Trustworthy Process Rewarding

📅 2025-01-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing RAG systems suffer from opaque multi-step reasoning, training biases, early-step bias, and limited reasoning capabilities. Method: We propose the Credible Process Reward (CPR) framework—the first to jointly train a Process Reward Model (PRM) and a Process Explanation Model (PEM), resolving their misalignment, annotation bias, and early-step bias. CPR introduces off-policy preference learning and temporal-difference forward search to refine reasoning steps at inference time by jointly leveraging PRM and PEM. During post-training, it synthesizes high-quality step-level preference data via Monte Carlo Tree Search and iterative preference optimization. Contribution/Results: On multi-step reasoning benchmarks, CPR significantly outperforms RAG+CoT and state-of-the-art PRM-based methods, achieving simultaneous improvements in reasoning accuracy and explanation consistency. This validates that process-level credible rewards substantially enhance RAG’s reasoning capability.

Technology Category

Application Category

📝 Abstract
Retrieval-Augmented Generation (RAG) systems for Large Language Models (LLMs) hold promise in knowledge-intensive tasks but face limitations in complex multi-step reasoning. While recent methods have integrated RAG with chain-of-thought reasoning or test-time search using Process Reward Models (PRMs), these approaches encounter challenges such as a lack of explanations, bias in PRM training data, early-step bias in PRM scores, and insufficient post-training optimization of reasoning potential. To address these issues, we propose Retrieval-Augmented Reasoning through Trustworthy Process Rewarding (ReARTeR), a framework that enhances RAG systems' reasoning capabilities through post-training and test-time scaling. At test time, ReARTeR introduces Trustworthy Process Rewarding via a Process Reward Model for accurate scalar scoring and a Process Explanation Model (PEM) for generating natural language explanations, enabling step refinement. During post-training, it utilizes Monte Carlo Tree Search guided by Trustworthy Process Rewarding to collect high-quality step-level preference data, optimized through Iterative Preference Optimization. ReARTeR addresses three core challenges: (1) misalignment between PRM and PEM, tackled through off-policy preference learning; (2) bias in PRM training data, mitigated by balanced annotation methods and stronger annotations for challenging examples; and (3) early-step bias in PRM, resolved through a temporal-difference-based look-ahead search strategy. Experimental results on multi-step reasoning benchmarks demonstrate significant improvements, underscoring ReARTeR's potential to advance the reasoning capabilities of RAG systems.
Problem

Research questions and friction points this paper is trying to address.

Enhanced Retrieval-Augmented Generation (RAG) systems
Complex reasoning limitations
Training data bias
Innovation

Methods, ideas, or system contributions that make the work stand out.

Retrieval-Augmented Generation
Process Explanation Model
Off-policy Learning
🔎 Similar Papers
No similar papers found.
Zhongxiang Sun
Zhongxiang Sun
Renmin University of China
SearchRecommendationLLMLegal
Q
Qipeng Wang
Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China
W
Weijie Yu
School of Information Technology and Management, University of International Business and Economics, Beijing, China
Xiaoxue Zang
Xiaoxue Zang
Kuaishou Technology
Recommender SystemNLPDialogueMultimodal Modeling
K
Kai Zheng
Kuaishou Technology Co., Ltd., Beijing, China
J
Jun Xu
Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China
X
Xiao Zhang
Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China
S
Song Yang
Kuaishou Technology Co., Ltd., Beijing, China
H
Han Li
Kuaishou Technology Co., Ltd., Beijing, China