Weights-Rotated Preference Optimization for Large Language Models

📅 2025-08-24

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

Direct Preference Optimization (DPO) suffers from reward hacking induced by neuron collapse—manifesting as verbose outputs, reduced diversity, and knowledge forgetting. Method: We propose a weight-rotation regularization mechanism that explicitly enforces multi-granularity orthogonality constraints on intermediate hidden-state representations via orthogonal matrices, while implicitly regularizing output-layer logits using KL divergence to suppress representational redundancy in parameter space and prevent policy deviation from the reference model. Integrated within the DPO framework, our approach enables stable preference learning under low-rank adaptation. Contribution/Results: Our method achieves +3.27 on AlpacaEval 2 and outperforms the strongest baseline by 6.2–7.5 on MT-Bench, with only 0.015% trainable parameters. It significantly mitigates reward hacking while preserving model expressivity and generalization.

Technology Category

Application Category

📝 Abstract

Despite the efficacy of Direct Preference Optimization (DPO) in aligning Large Language Models (LLMs), reward hacking remains a pivotal challenge. This issue emerges when LLMs excessively reduce the probability of rejected completions to achieve high rewards, without genuinely meeting their intended goals. As a result, this leads to overly lengthy generation lacking diversity, as well as catastrophic forgetting of knowledge. We investigate the underlying reason behind this issue, which is representation redundancy caused by neuron collapse in the parameter space. Hence, we propose a novel Weights-Rotated Preference Optimization (RoPO) algorithm, which implicitly constrains the output layer logits with the KL divergence inherited from DPO and explicitly constrains the intermediate hidden states by fine-tuning on a multi-granularity orthogonal matrix. This design prevents the policy model from deviating too far from the reference model, thereby retaining the knowledge and expressive capabilities acquired during pre-training and SFT stages. Our RoPO achieves up to a 3.27-point improvement on AlpacaEval 2, and surpasses the best baseline by 6.2 to 7.5 points on MT-Bench with merely 0.015% of the trainable parameters, demonstrating its effectiveness in alleviating the reward hacking problem of DPO.

Problem

Research questions and friction points this paper is trying to address.

Addresses reward hacking in DPO for LLM alignment

Mitigates representation redundancy from neuron collapse

Prevents catastrophic forgetting and maintains generation diversity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Weights-rotated algorithm prevents reward hacking

Constrains output logits and hidden states

Uses orthogonal matrix fine-tuning for diversity

🔎 Similar Papers

No similar papers found.