Mitigating Reward Over-optimization in Direct Alignment Algorithms with Importance Sampling

πŸ“… 2025-06-10
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Direct preference optimization (DPO) and similar alignment algorithms suffer from reward over-optimization, causing policy divergence from the reference model and performance degradation. To address this, we propose IS-DAAsβ€”*Importance-Sampling-based Direct Alignment Algorithms*β€”the first offline direct alignment framework incorporating clipped importance sampling ratios. By jointly optimizing bias correction and variance reduction via importance sampling, IS-DAAs significantly mitigates over-optimization even under low regularization strength. Crucially, it requires no additional training objectives or hyperparameter tuning, preserving reference policy consistency while improving preference alignment accuracy and generation quality. Extensive evaluation across diverse multitask benchmarks demonstrates that IS-DAAs consistently outperforms existing over-optimization mitigation methods, validating its effectiveness and generalizability.

Technology Category

Application Category

πŸ“ Abstract
Direct Alignment Algorithms (DAAs) such as Direct Preference Optimization (DPO) have emerged as alternatives to the standard Reinforcement Learning from Human Feedback (RLHF) for aligning large language models (LLMs) with human values. However, these methods are more susceptible to over-optimization, in which the model drifts away from the reference policy, leading to degraded performance as training progresses. This paper proposes a novel importance-sampling approach to mitigate the over-optimization problem of offline DAAs. This approach, called (IS-DAAs), multiplies the DAA objective with an importance ratio that accounts for the reference policy distribution. IS-DAAs additionally avoid the high variance issue associated with importance sampling by clipping the importance ratio to a maximum value. Our extensive experiments demonstrate that IS-DAAs can effectively mitigate over-optimization, especially under low regularization strength, and achieve better performance than other methods designed to address this problem. Our implementations are provided publicly at this link.
Problem

Research questions and friction points this paper is trying to address.

Mitigates reward over-optimization in Direct Alignment Algorithms
Addresses model drift from reference policy in DAAs
Reduces high variance in importance sampling for DAAs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Importance sampling mitigates reward over-optimization
Clipped importance ratio reduces high variance
IS-DAAs outperform under low regularization
πŸ”Ž Similar Papers
No similar papers found.
P
Phuc Minh Nguyen
VinUniversity
N
Ngoc-Hieu Nguyen
VinUniversity
D
Duy H. M. Nguyen
Max Planck Research School for Intelligent Systems (IMPRS-IS)
Anji Liu
Anji Liu
Assistant Professor, National University of Singapore
Machine LearningGenerative ModelsProbabilistic Circuits
A
An Mai
International University - VNUHCM
Binh T. Nguyen
Binh T. Nguyen
VinUniversity
statisticsoptimal transport
Daniel Sonntag
Daniel Sonntag
DFKI and University of Oldenburg
Interactive Machine LearningIntelligent User InterfacesMultimodal Interaction
K
Khoa D. Doan
VinUni-Illinois Smart Health Center