DEFT: Distribution-guided Efficient Fine-Tuning for Human Alignment

📅 2026-04-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing human feedback–based alignment methods, such as reinforcement learning from human feedback (RLHF), which rely on large-scale preference data, incur high costs, suffer from training instability, and often degrade model generalization. To overcome these challenges, the authors propose DEFT, an efficient alignment framework that introduces a novel differential distributional reward mechanism. This mechanism quantifies the divergence between the language model’s output distribution and the distribution implied by preference data, enabling the selection of a high-quality, small-scale subset for training. DEFT then integrates supervised fine-tuning with contrastive learning to guide distributional alignment. Experimental results demonstrate that DEFT significantly reduces both data requirements and training time while simultaneously improving alignment performance and model generalization, outperforming current state-of-the-art approaches across the board.
📝 Abstract
Reinforcement Learning from Human Feedback (RLHF), using algorithms like Proximal Policy Optimization (PPO), aligns Large Language Models (LLMs) with human values but is costly and unstable. Alternatives have been proposed to replace PPO or integrate Supervised Fine-Tuning (SFT) and contrastive learning for direct fine-tuning and value alignment. However, these methods still require voluminous data to learn preferences and may weaken the generalization ability of LLMs. To further enhance alignment efficiency and performance while mitigating the loss of generalization ability, this paper introduces Distribution-guided Efficient Fine-Tuning (DEFT), an efficient alignment framework incorporating data filtering and distributional guidance by calculating the differential distribution reward based on the output distribution of language model and the discrepancy distribution of preference data. A small yet high-quality subset is filtered from the raw data using a differential distribution reward, which is then incorporated into existing alignment methods to guide the model's output distribution. Experimental results demonstrate that the methods enhanced by DEFT outperform the original methods in both alignment capability and generalization ability, with significantly reduced training time.
Problem

Research questions and friction points this paper is trying to address.

Reinforcement Learning from Human Feedback
Large Language Models
human alignment
generalization ability
preference data
Innovation

Methods, ideas, or system contributions that make the work stand out.

distribution-guided fine-tuning
differential distribution reward
data filtering
human alignment
efficient LLM alignment
🔎 Similar Papers
No similar papers found.
L
Liang Zhu
Southern University of Science and Technology
Feiteng Fang
Feiteng Fang
University of Science and Technology of China
LLMNLP
Y
Yuelin Bai
Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences
Longze Chen
Longze Chen
Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences
Natural Language Processing
Z
Zhexiang Zhang
University of Copenhagen
M
Minghuan Tan
Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences
Min Yang
Min Yang
Bytedance
Vision Language ModelComputer VisionVideo Understanding