LeTS: Learning to Think-and-Search via Process-and-Outcome Reward Hybridization

📅 2025-05-23

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

In Retrieval-Augmented Generation (RAG), large language model (LLM) reasoning remains uncontrolled, and existing reinforcement learning (RL) approaches rely solely on final-answer feedback, neglecting the correctness of intermediate reasoning steps and retrieval outcomes. Method: This paper proposes a novel process–outcome dual-level reward RL framework for RAG. It introduces the first human-annotation-free process-level reward modeling mechanism, jointly optimizing retrieval and reasoning steps within the RAG pipeline. The framework synergistically models reward signals for both the plausibility of the intermediate reasoning–retrieval chain and the accuracy of the final answer. Contribution/Results: The method significantly enhances reasoning interpretability and answer accuracy. Extensive evaluation across multiple RAG benchmarks demonstrates consistent superiority over state-of-the-art baselines in reasoning accuracy, generalization, and inference efficiency—validating that process-aware supervision provides a broadly effective boost to LLM reasoning capabilities in RAG.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have demonstrated impressive capabilities in reasoning with the emergence of reasoning models like OpenAI-o1 and DeepSeek-R1. Recent research focuses on integrating reasoning capabilities into the realm of retrieval-augmented generation (RAG) via outcome-supervised reinforcement learning (RL) approaches, while the correctness of intermediate think-and-search steps is usually neglected. To address this issue, we design a process-level reward module to mitigate the unawareness of intermediate reasoning steps in outcome-level supervision without additional annotation. Grounded on this, we propose Learning to Think-and-Search (LeTS), a novel framework that hybridizes stepwise process reward and outcome-based reward to current RL methods for RAG. Extensive experiments demonstrate the generalization and inference efficiency of LeTS across various RAG benchmarks. In addition, these results reveal the potential of process- and outcome-level reward hybridization in boosting LLMs' reasoning ability via RL under other scenarios. The code will be released soon.

Problem

Research questions and friction points this paper is trying to address.

Address neglect of intermediate reasoning steps in RAG

Hybridize process and outcome rewards for RL in RAG

Enhance LLMs' reasoning via reward hybridization in RL

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybridizes process and outcome rewards

Mitigates neglect of intermediate reasoning steps

Enhances RAG with stepwise RL supervision

🔎 Similar Papers

Mutual Enhancement of Large Language and Reinforcement Learning Models through Bi-Directional Feedback Mechanisms: A Case Study