🤖 AI Summary
Existing RAG systems suffer from opaque multi-step reasoning, training biases, early-step bias, and limited reasoning capabilities. Method: We propose the Credible Process Reward (CPR) framework—the first to jointly train a Process Reward Model (PRM) and a Process Explanation Model (PEM), resolving their misalignment, annotation bias, and early-step bias. CPR introduces off-policy preference learning and temporal-difference forward search to refine reasoning steps at inference time by jointly leveraging PRM and PEM. During post-training, it synthesizes high-quality step-level preference data via Monte Carlo Tree Search and iterative preference optimization. Contribution/Results: On multi-step reasoning benchmarks, CPR significantly outperforms RAG+CoT and state-of-the-art PRM-based methods, achieving simultaneous improvements in reasoning accuracy and explanation consistency. This validates that process-level credible rewards substantially enhance RAG’s reasoning capability.
📝 Abstract
Retrieval-Augmented Generation (RAG) systems for Large Language Models (LLMs) hold promise in knowledge-intensive tasks but face limitations in complex multi-step reasoning. While recent methods have integrated RAG with chain-of-thought reasoning or test-time search using Process Reward Models (PRMs), these approaches encounter challenges such as a lack of explanations, bias in PRM training data, early-step bias in PRM scores, and insufficient post-training optimization of reasoning potential. To address these issues, we propose Retrieval-Augmented Reasoning through Trustworthy Process Rewarding (ReARTeR), a framework that enhances RAG systems' reasoning capabilities through post-training and test-time scaling. At test time, ReARTeR introduces Trustworthy Process Rewarding via a Process Reward Model for accurate scalar scoring and a Process Explanation Model (PEM) for generating natural language explanations, enabling step refinement. During post-training, it utilizes Monte Carlo Tree Search guided by Trustworthy Process Rewarding to collect high-quality step-level preference data, optimized through Iterative Preference Optimization. ReARTeR addresses three core challenges: (1) misalignment between PRM and PEM, tackled through off-policy preference learning; (2) bias in PRM training data, mitigated by balanced annotation methods and stronger annotations for challenging examples; and (3) early-step bias in PRM, resolved through a temporal-difference-based look-ahead search strategy. Experimental results on multi-step reasoning benchmarks demonstrate significant improvements, underscoring ReARTeR's potential to advance the reasoning capabilities of RAG systems.