Reinforcement Learning Enhanced LLMs: A Survey

📅 2024-12-05
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
The RL-enhanced large language model (RL-LLM) field suffers from fragmented understanding due to algorithmic complexity, heterogeneous reward modeling approaches, and the absence of comprehensive surveys. Method: This work systematically reviews state-of-the-art advances from 2022–2024, unifying three alignment paradigms—Reinforcement Learning from Human Feedback (RLHF), Reinforcement Learning from AI Feedback (RLAIF), and Direct Preference Optimization (DPO)—within a cross-method comparative framework. It identifies shared bottlenecks: reward model bias, low sample efficiency, and poor generalization. Contribution/Results: We propose a standardized taxonomy and a structured knowledge graph for RL-LLMs, and open-source an extensible survey platform. Our synthesis delivers a unified conceptual framework, a clear technological evolution roadmap, and actionable future research directions—thereby advancing RL-LLM research from empirical trial-and-error toward systematic, principled development.

Technology Category

Application Category

📝 Abstract
Reinforcement learning (RL) enhanced large language models (LLMs), particularly exemplified by DeepSeek-R1, have exhibited outstanding performance. Despite the effectiveness in improving LLM capabilities, its implementation remains highly complex, requiring complex algorithms, reward modeling strategies, and optimization techniques. This complexity poses challenges for researchers and practitioners in developing a systematic understanding of RL-enhanced LLMs. Moreover, the absence of a comprehensive survey summarizing existing research on RL-enhanced LLMs has limited progress in this domain, hindering further advancements. In this work, we are going to make a systematic review of the most up-to-date state of knowledge on RL-enhanced LLMs, attempting to consolidate and analyze the rapidly growing research in this field, helping researchers understand the current challenges and advancements. Specifically, we (1) detail the basics of RL; (2) introduce popular RL-enhanced LLMs; (3) review researches on two widely-used reward model-based RL techniques: Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning from AI Feedback (RLAIF); and (4) explore Direct Preference Optimization (DPO), a set of methods that bypass the reward model to directly use human preference data for aligning LLM outputs with human expectations. We will also point out current challenges and deficiencies of existing methods and suggest some avenues for further improvements. Project page of this work can be found at https://github.com/ShuheWang1998/Reinforcement-Learning-Enhanced-LLMs-A-Survey.
Problem

Research questions and friction points this paper is trying to address.

Systematically review RL-enhanced LLMs
Analyze challenges in RL implementation
Summarize advancements in RL techniques
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement Learning Enhanced LLMs
Reward Modeling Strategies
Direct Preference Optimization Methods
🔎 Similar Papers
No similar papers found.
Shuhe Wang
Shuhe Wang
Peking University, University of Melbourne
Natural Language ProcessingMachine Learning
S
Shengyu Zhang
Zhejiang University
J
Jie Zhang
CFAR and IHPC, A*STAR, Singapore
Runyi Hu
Runyi Hu
Nanyang Technological University
Large Language ModelAI AlignmentWatermarking
Xiaoya Li
Xiaoya Li
University of Washington
T
Tianwei Zhang
Nanyang Technological University
J
Jiwei Li
Zhejiang University
F
Fei Wu
Zhejiang University
G
Guoyin Wang
E
Eduard H. Hovy
The University of Melbourne