LeTS: Learning to Think-and-Search via Process-and-Outcome Reward Hybridization

📅 2025-05-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In Retrieval-Augmented Generation (RAG), large language model (LLM) reasoning remains uncontrolled, and existing reinforcement learning (RL) approaches rely solely on final-answer feedback, neglecting the correctness of intermediate reasoning steps and retrieval outcomes. Method: This paper proposes a novel process–outcome dual-level reward RL framework for RAG. It introduces the first human-annotation-free process-level reward modeling mechanism, jointly optimizing retrieval and reasoning steps within the RAG pipeline. The framework synergistically models reward signals for both the plausibility of the intermediate reasoning–retrieval chain and the accuracy of the final answer. Contribution/Results: The method significantly enhances reasoning interpretability and answer accuracy. Extensive evaluation across multiple RAG benchmarks demonstrates consistent superiority over state-of-the-art baselines in reasoning accuracy, generalization, and inference efficiency—validating that process-aware supervision provides a broadly effective boost to LLM reasoning capabilities in RAG.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have demonstrated impressive capabilities in reasoning with the emergence of reasoning models like OpenAI-o1 and DeepSeek-R1. Recent research focuses on integrating reasoning capabilities into the realm of retrieval-augmented generation (RAG) via outcome-supervised reinforcement learning (RL) approaches, while the correctness of intermediate think-and-search steps is usually neglected. To address this issue, we design a process-level reward module to mitigate the unawareness of intermediate reasoning steps in outcome-level supervision without additional annotation. Grounded on this, we propose Learning to Think-and-Search (LeTS), a novel framework that hybridizes stepwise process reward and outcome-based reward to current RL methods for RAG. Extensive experiments demonstrate the generalization and inference efficiency of LeTS across various RAG benchmarks. In addition, these results reveal the potential of process- and outcome-level reward hybridization in boosting LLMs' reasoning ability via RL under other scenarios. The code will be released soon.
Problem

Research questions and friction points this paper is trying to address.

Address neglect of intermediate reasoning steps in RAG
Hybridize process and outcome rewards for RL in RAG
Enhance LLMs' reasoning via reward hybridization in RL
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybridizes process and outcome rewards
Mitigates neglect of intermediate reasoning steps
Enhances RAG with stepwise RL supervision
🔎 Similar Papers
No similar papers found.
Q
Qi Zhang
Zhejiang University, MYBank, Ant Group
S
Shouqing Yang
Zhejiang University, MYBank, Ant Group
Lirong Gao
Lirong Gao
Zhejiang University
LLMs
H
Hao Chen
Zhejiang University, MYBank, Ant Group
X
Xiaomeng Hu
Zhejiang University, MYBank, Ant Group
J
Jinglei Chen
MYBank, Ant Group
J
Jiexiang Wang
MYBank, Ant Group
Sheng Guo
Sheng Guo
Ant Group
Computer VisionDeep LearningLLM
B
Bo Zheng
MYBank, Ant Group
Haobo Wang
Haobo Wang
Zhejiang University
Machine Learning
J
Junbo Zhao
Zhejiang University