Simplify RLHF as Reward-Weighted SFT: A Variational Method

📅 2025-02-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Reinforcement Learning from Human Feedback (RLHF), while pivotal for aligning large language models (LLMs) with human values, suffers from high implementation complexity, training instability, and susceptibility to overfitting. To address these limitations, we propose VAR (Variational Alignment via Reward-weighted Supervised Fine-Tuning), the first alignment framework grounded in variational inference. VAR reformulates the RLHF objective as a reward-weighted supervised fine-tuning (SFT) problem, directly minimizing the KL divergence between the policy distribution and an optimal reference distribution—thereby eliminating the entire reinforcement learning loop. By introducing variational approximation, reward reweighting, and SFT loss adaptation, VAR retains end-to-end differentiability while substantially improving training stability. Empirical results demonstrate that VAR matches or exceeds state-of-the-art methods—including DPO and A-LoL—in key metrics such as helpfulness and harmlessness, while achieving more robust convergence and significantly simpler implementation.

Technology Category

Application Category

📝 Abstract
Reinforcement Learning from Human Feedback (RLHF) is crucial for aligning Large Language Models (LLMs) with human values. However, RLHF has been continuously challenged by its high complexity in implementation and computation consumption. Even with recent simplifications, such as Direct Preference Optimization (DPO) and Advantage Leftover Lunch (A-LoL), the problems of over-fitting and training instability remain hindering the alignment process from the expected optimal performance. To address the existing challenges, we propose a novel simplification of RLHF from the perspective of variational inference, called $ extbf{V}$ariational $ extbf{A}$lignment with $ extbf{R}$e-weighting ($ extbf{VAR}$). More specifically, by directly minimizing the distribution gap between the learning LLM policy and the optimal solution of RLHF, we transform the alignment objective into a reward-driven re-weighted supervised fine-tuning (SFT) form, which only requires minor adjustment on the SFT loss to obtain noticeable improvement on training stability and effectiveness. On comprehensive alignment and generation benchmarks, our VAR method has numerically achieved competitive performance in LLM alignment helpfulness and harmlessness.
Problem

Research questions and friction points this paper is trying to address.

Simplify RLHF complexity
Address over-fitting issues
Improve training stability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Variational inference simplifies RLHF
Reward-driven re-weighted SFT method
Enhances training stability and effectiveness
Y
Yuhao Du
The Chinese University of Hong Kong, Shenzhen; Shenzhen Research Institute of Big Data
Z
Zhuo Li
The Chinese University of Hong Kong, Shenzhen; Shenzhen Research Institute of Big Data
Pengyu Cheng
Pengyu Cheng
Alibaba Group
machine learningnatural language processing
Z
Zhihong Chen
Stanford University
Yuejiao Xie
Yuejiao Xie
CUHKSZ
RLLLMRoute Planning
Xiang Wan
Xiang Wan
Shenzhen Research Institute of Big Data
BioinformaticsData MiningBig Data Analysis
A
Anningzhe Gao
ByteDance