Simplify RLHF as Reward-Weighted SFT: A Variational Method

📅 2025-02-16

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

Reinforcement Learning from Human Feedback (RLHF), while pivotal for aligning large language models (LLMs) with human values, suffers from high implementation complexity, training instability, and susceptibility to overfitting. To address these limitations, we propose VAR (Variational Alignment via Reward-weighted Supervised Fine-Tuning), the first alignment framework grounded in variational inference. VAR reformulates the RLHF objective as a reward-weighted supervised fine-tuning (SFT) problem, directly minimizing the KL divergence between the policy distribution and an optimal reference distribution—thereby eliminating the entire reinforcement learning loop. By introducing variational approximation, reward reweighting, and SFT loss adaptation, VAR retains end-to-end differentiability while substantially improving training stability. Empirical results demonstrate that VAR matches or exceeds state-of-the-art methods—including DPO and A-LoL—in key metrics such as helpfulness and harmlessness, while achieving more robust convergence and significantly simpler implementation.

Technology Category

Application Category

📝 Abstract

Reinforcement Learning from Human Feedback (RLHF) is crucial for aligning Large Language Models (LLMs) with human values. However, RLHF has been continuously challenged by its high complexity in implementation and computation consumption. Even with recent simplifications, such as Direct Preference Optimization (DPO) and Advantage Leftover Lunch (A-LoL), the problems of over-fitting and training instability remain hindering the alignment process from the expected optimal performance. To address the existing challenges, we propose a novel simplification of RLHF from the perspective of variational inference, called $ extbf{V}$ariational $ extbf{A}$lignment with $ extbf{R}$e-weighting ($ extbf{VAR}$). More specifically, by directly minimizing the distribution gap between the learning LLM policy and the optimal solution of RLHF, we transform the alignment objective into a reward-driven re-weighted supervised fine-tuning (SFT) form, which only requires minor adjustment on the SFT loss to obtain noticeable improvement on training stability and effectiveness. On comprehensive alignment and generation benchmarks, our VAR method has numerically achieved competitive performance in LLM alignment helpfulness and harmlessness.

Problem

Research questions and friction points this paper is trying to address.

Simplify RLHF complexity

Address over-fitting issues

Improve training stability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Variational inference simplifies RLHF

Reward-driven re-weighted SFT method

Enhances training stability and effectiveness

🔎 Similar Papers

Symmetric Reinforcement Learning Loss for Robust Learning on Diverse Tasks and Model Scales