Weights-Rotated Preference Optimization for Large Language Models

📅 2025-08-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Direct Preference Optimization (DPO) suffers from reward hacking induced by neuron collapse—manifesting as verbose outputs, reduced diversity, and knowledge forgetting. Method: We propose a weight-rotation regularization mechanism that explicitly enforces multi-granularity orthogonality constraints on intermediate hidden-state representations via orthogonal matrices, while implicitly regularizing output-layer logits using KL divergence to suppress representational redundancy in parameter space and prevent policy deviation from the reference model. Integrated within the DPO framework, our approach enables stable preference learning under low-rank adaptation. Contribution/Results: Our method achieves +3.27 on AlpacaEval 2 and outperforms the strongest baseline by 6.2–7.5 on MT-Bench, with only 0.015% trainable parameters. It significantly mitigates reward hacking while preserving model expressivity and generalization.

Technology Category

Application Category

📝 Abstract
Despite the efficacy of Direct Preference Optimization (DPO) in aligning Large Language Models (LLMs), reward hacking remains a pivotal challenge. This issue emerges when LLMs excessively reduce the probability of rejected completions to achieve high rewards, without genuinely meeting their intended goals. As a result, this leads to overly lengthy generation lacking diversity, as well as catastrophic forgetting of knowledge. We investigate the underlying reason behind this issue, which is representation redundancy caused by neuron collapse in the parameter space. Hence, we propose a novel Weights-Rotated Preference Optimization (RoPO) algorithm, which implicitly constrains the output layer logits with the KL divergence inherited from DPO and explicitly constrains the intermediate hidden states by fine-tuning on a multi-granularity orthogonal matrix. This design prevents the policy model from deviating too far from the reference model, thereby retaining the knowledge and expressive capabilities acquired during pre-training and SFT stages. Our RoPO achieves up to a 3.27-point improvement on AlpacaEval 2, and surpasses the best baseline by 6.2 to 7.5 points on MT-Bench with merely 0.015% of the trainable parameters, demonstrating its effectiveness in alleviating the reward hacking problem of DPO.
Problem

Research questions and friction points this paper is trying to address.

Addresses reward hacking in DPO for LLM alignment
Mitigates representation redundancy from neuron collapse
Prevents catastrophic forgetting and maintains generation diversity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Weights-rotated algorithm prevents reward hacking
Constrains output logits and hidden states
Uses orthogonal matrix fine-tuning for diversity
🔎 Similar Papers
No similar papers found.
Chenxu Yang
Chenxu Yang
Institute of Information Engineering, Chinese Academy of Sciences
NLPDialogue Generation
R
Ruipeng Jia
Baidu Inc., Beijing, China
Mingyu Zheng
Mingyu Zheng
Institute of Information Engineering, CAS
NLPTable UnderstandingLLMs
N
Naibin Gu
Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China
Z
Zheng Lin
Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China
S
Siyuan Chen
Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China
W
Weichong Yin
Baidu Inc., Beijing, China
H
Hua Wu
Baidu Inc., Beijing, China
Weiping Wang
Weiping Wang
School of Information Science and Engineering, Central South University
Computer NetworkNetwork Security