🤖 AI Summary
In Retrieval-Augmented Generation (RAG), large language model (LLM) reasoning remains uncontrolled, and existing reinforcement learning (RL) approaches rely solely on final-answer feedback, neglecting the correctness of intermediate reasoning steps and retrieval outcomes.
Method: This paper proposes a novel process–outcome dual-level reward RL framework for RAG. It introduces the first human-annotation-free process-level reward modeling mechanism, jointly optimizing retrieval and reasoning steps within the RAG pipeline. The framework synergistically models reward signals for both the plausibility of the intermediate reasoning–retrieval chain and the accuracy of the final answer.
Contribution/Results: The method significantly enhances reasoning interpretability and answer accuracy. Extensive evaluation across multiple RAG benchmarks demonstrates consistent superiority over state-of-the-art baselines in reasoning accuracy, generalization, and inference efficiency—validating that process-aware supervision provides a broadly effective boost to LLM reasoning capabilities in RAG.
📝 Abstract
Large language models (LLMs) have demonstrated impressive capabilities in reasoning with the emergence of reasoning models like OpenAI-o1 and DeepSeek-R1. Recent research focuses on integrating reasoning capabilities into the realm of retrieval-augmented generation (RAG) via outcome-supervised reinforcement learning (RL) approaches, while the correctness of intermediate think-and-search steps is usually neglected. To address this issue, we design a process-level reward module to mitigate the unawareness of intermediate reasoning steps in outcome-level supervision without additional annotation. Grounded on this, we propose Learning to Think-and-Search (LeTS), a novel framework that hybridizes stepwise process reward and outcome-based reward to current RL methods for RAG. Extensive experiments demonstrate the generalization and inference efficiency of LeTS across various RAG benchmarks. In addition, these results reveal the potential of process- and outcome-level reward hybridization in boosting LLMs' reasoning ability via RL under other scenarios. The code will be released soon.