A Survey of Reinforcement Learning from Human Feedback

📅 2023-12-22

🏛️ arXiv.org

📈 Citations: 263

✨ Influential: 11

career value

275K/year

🤖 AI Summary

Reinforcement learning (RL) often relies on hand-crafted reward functions that struggle to align with complex, nuanced human values. Method: This work systematically reviews RL from Human Feedback (RLHF), integrating reinforcement learning, Bayesian inference, preference modeling, reward modeling, and human-in-the-loop evaluation to support heterogeneous, multi-source feedback. It introduces the first unified, cross-task and cross-modal analytical framework for RLHF—extending beyond traditional preference-based RL (PbRL) limitations. Contribution/Results: The framework establishes a rigorous theoretical foundation and practical roadmap for human-AI value alignment. It clarifies the technical evolution, identifies core challenges (e.g., feedback sparsity, bias propagation, scalability), and proposes a standardized taxonomy for RLHF research. This serves as a comprehensive guide for algorithm design, ethical assessment, and real-world deployment—enabling principled, scalable, and value-aligned AI systems.

📝 Abstract

Reinforcement learning from human feedback (RLHF) is a variant of reinforcement learning (RL) that learns from human feedback instead of relying on an engineered reward function. Building on prior work on the related setting of preference-based reinforcement learning (PbRL), it stands at the intersection of artificial intelligence and human-computer interaction. This positioning offers a promising avenue to enhance the performance and adaptability of intelligent systems while also improving the alignment of their objectives with human values. The training of large language models (LLMs) has impressively demonstrated this potential in recent years, where RLHF played a decisive role in directing the model's capabilities toward human objectives. This article provides a comprehensive overview of the fundamentals of RLHF, exploring the intricate dynamics between RL agents and human input. While recent focus has been on RLHF for LLMs, our survey adopts a broader perspective, examining the diverse applications and wide-ranging impact of the technique. We delve into the core principles that underpin RLHF, shedding light on the symbiotic relationship between algorithms and human feedback, and discuss the main research trends in the field. By synthesizing the current landscape of RLHF research, this article aims to provide researchers as well as practitioners with a comprehensive understanding of this rapidly growing field of research.

Problem

Research questions and friction points this paper is trying to address.

Surveying RLHF fundamentals and human-agent interaction

Exploring RLHF applications beyond large language models

Providing comprehensive coverage in control, robotics, and LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses human feedback instead of engineered rewards

Applies technique across control, robotics, and language models

Combines algorithms with human feedback for alignment

🔎 Similar Papers

No similar papers found.