🤖 AI Summary
This work addresses the lack of a unified theoretical foundation for Reinforcement Learning from Human Feedback (RLHF). We propose the first contextual preference bandit framework covering the entire RLHF lifecycle—from training to deployment. Methodologically, we model human preferences via a linear Bradley–Terry function and design a multi-stage adaptive algorithm supporting both passive and active data collection. Crucially, our approach provides the first provable guarantees—simultaneously on statistical convergence and computational efficiency—for the full RLHF pipeline. Empirical evaluation involves fine-tuning Llama-3-8B-Instruct on the UltraFeedback-binarized dataset. Results demonstrate that our method significantly improves both training efficiency and deployment performance over existing approaches, achieves tighter statistical error bounds, reduces computational overhead, and yields superior alignment quality.
📝 Abstract
Reinforcement Learning from Human Feedback (RLHF) is a widely used approach for aligning Large Language Models (LLMs) with human preferences. While recent advancements have provided valuable insights into various stages and settings of RLHF, a comprehensive theoretical understanding of the entire RLHF pipeline remains lacking. Towards this end, we propose a unified framework for the RLHF pipeline from the view of contextual bandits and provide provable efficiency guarantees. In particular, we decompose the RLHF process into two distinct stages: (post-)training and deployment, exploring both passive and active data collection strategies during the training phase. By employing the Bradley-Terry preference model with a linearly parameterized reward function, we reformulate RLHF as a contextual preference bandit problem. We then develop novel algorithms for each stage, demonstrating significant improvements over existing approaches in both statistical and computational efficiency. Finally, we apply our method to train and deploy Llama-3-8B-Instruct on the Ultrafeedback-binarized dataset, and empirical results confirm the effectiveness of our approach.