🤖 AI Summary
Retrieval-augmented generation (RAG) suffers from frequent hallucinations and inefficient evaluation. Method: This paper introduces reinforcement learning from human feedback (RLHF) for RAG optimization—first systematically applying it to this domain. We propose four core RAG quality dimensions: hallucination-free output, comprehensiveness, reliability, and efficiency; construct the first RLHF-oriented reward modeling dataset for RAG; design a multi-LLM collaborative automated annotation pipeline leveraging GPT-4o to generate high-quality preference data; train a dedicated RAG reward model; and perform end-to-end policy optimization via proximal policy optimization (PPO). Contribution/Results: Our reward model achieves state-of-the-art performance on held-out validation sets. The optimized policy model significantly reduces hallucination rates while improving answer completeness and factual consistency—establishing a novel paradigm for enhancing RAG trustworthiness.
📝 Abstract
Retrieval-augmented generation (RAG) enhances Large Language Models (LLMs) with relevant and up-to-date knowledge, improving their ability to answer knowledge-intensive questions. It has been shown to enhance both generation quality and trustworthiness. While numerous works have focused on improving retrieval, generation, and evaluation, the role of reward models in reinforcement learning for optimizing RAG and establishing automated benchmarking pipelines remains underexplored. In this paper, we introduce extbf{RAG-Reward}, a dataset designed to enable extit{hallucination-free, comprehensive, reliable, and efficient RAG}. We define four key metrics for assessing generation quality and develop an automated annotation pipeline that leverages multiple LLMs to generate outputs across diverse RAG scenarios. GPT-4o is used to evaluate and construct preference data. Using extbf{RAG-Reward}, we train reward models and apply reinforcement learning with human feedback (RLHF) to improve LLMs' effectiveness in RAG. Experimental results show that our reward model achieves state-of-the-art performance on a held-out test set, demonstrating both the effectiveness of our approach and the quality of our dataset. Furthermore, the improved generation quality of the trained policy model highlights the feasibility of using RLHF to enhance RAG pipelines.