Reinforcement Learning Meets Large Language Models: A Survey of Advancements and Applications Across the LLM Lifecycle

📅 2025-09-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing surveys lack systematic coverage of reinforcement learning (RL) across the full lifecycle of large language models (LLMs). This work presents the first comprehensive review of RL’s theoretical advances and practical applications in three critical LLM stages: pretraining, alignment fine-tuning, and reinforcement-based reasoning—highlighting mechanistic insights from methods such as RLVR for enhancing both reasoning capability and alignment performance. We innovatively unify three verifiable reward signal sources—human annotations, AI-assisted preferences, and programmatic validation—and conduct empirical analysis using mainstream open-source RL frameworks. Concurrently, we construct, curate, and benchmark key datasets and evaluation protocols. Our contributions include a structured knowledge graph for RL-LLM research, reproducible technical pipelines, and a forward-looking roadmap emphasizing the co-evolution of intelligence, generalization, and safety in next-generation RL-LLM systems.

Technology Category

Application Category

📝 Abstract
In recent years, training methods centered on Reinforcement Learning (RL) have markedly enhanced the reasoning and alignment performance of Large Language Models (LLMs), particularly in understanding human intents, following user instructions, and bolstering inferential strength. Although existing surveys offer overviews of RL augmented LLMs, their scope is often limited, failing to provide a comprehensive summary of how RL operates across the full lifecycle of LLMs. We systematically review the theoretical and practical advancements whereby RL empowers LLMs, especially Reinforcement Learning with Verifiable Rewards (RLVR). First, we briefly introduce the basic theory of RL. Second, we thoroughly detail application strategies for RL across various phases of the LLM lifecycle, including pre-training, alignment fine-tuning, and reinforced reasoning. In particular, we emphasize that RL methods in the reinforced reasoning phase serve as a pivotal driving force for advancing model reasoning to its limits. Next, we collate existing datasets and evaluation benchmarks currently used for RL fine-tuning, spanning human-annotated datasets, AI-assisted preference data, and program-verification-style corpora. Subsequently, we review the mainstream open-source tools and training frameworks available, providing clear practical references for subsequent research. Finally, we analyse the future challenges and trends in the field of RL-enhanced LLMs. This survey aims to present researchers and practitioners with the latest developments and frontier trends at the intersection of RL and LLMs, with the goal of fostering the evolution of LLMs that are more intelligent, generalizable, and secure.
Problem

Research questions and friction points this paper is trying to address.

Surveying how Reinforcement Learning enhances Large Language Models' capabilities
Analyzing RL applications across the entire LLM lifecycle from pre-training to reasoning
Providing comprehensive overview of RL methods, datasets, tools and future challenges
Innovation

Methods, ideas, or system contributions that make the work stand out.

RLVR enhances LLMs with verifiable reward mechanisms
RL applied across LLM lifecycle from pre-training to reasoning
Reinforced reasoning phase pushes model reasoning to limits
🔎 Similar Papers
No similar papers found.
K
Keliang Liu
Fudan University, China
Dingkang Yang
Dingkang Yang
ByteDance
Multimodal LearningGenerative AIEmbodied AI
Z
Ziyun Qian
Fudan University, China
Weijie Yin
Weijie Yin
ByteDance
Vision Language ModelDeep LearningAI4S
Yuchi Wang
Yuchi Wang
CUHK MMLab; Peking Uninversity
MultimodalityVLMGenerative Models
H
Hongsheng Li
The Chinese University of Hong Kong, MMLab, China
J
Jun Liu
Lancaster University, UK
P
Peng Zhai
Fudan University, China
Y
Yang Liu
Tongji University, China and The University of Toronto, Canada
Lihua Zhang
Lihua Zhang
Wuhan University
computational biologybioinformaticsdata mining