Weak-to-Strong Preference Optimization: Stealing Reward from Weak Aligned Model

📅 2024-10-24
🏛️ arXiv.org
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
To address the heavy reliance of strong language models on high-quality human preference data for alignment, this paper proposes Weak-to-Strong Preference Optimization (WSPO), the first method to introduce the weak-to-strong generalization paradigm into model alignment. WSPO enables label-free transfer and amplification of alignment capability by modeling distributional discrepancies in alignment behaviors across weakly aligned models—specifically via alignment-state distillation from weak models, distribution-difference-guided preference optimization, and reverse reward signal transfer. Evaluated on Qwen2-7B-Instruct, WSPO achieves a 9.9-percentage-point gain in Arena-Hard win rate (reaching 49.60%), attains a 47.04% win rate on AlpacaEval 2 with length control, and scores 7.33 on MT-Bench. These results substantially reduce dependence on human annotations and empirically validate the core finding that alignment capability can be effectively transferred across model scales.

Technology Category

Application Category

📝 Abstract
Aligning language models (LMs) with human preferences has become a key area of research, enabling these models to meet diverse user needs better. Inspired by weak-to-strong generalization, where a strong LM fine-tuned on labels generated by a weaker model can consistently outperform its weak supervisor, we extend this idea to model alignment. In this work, we observe that the alignment behavior in weaker models can be effectively transferred to stronger models and even exhibit an amplification effect. Based on this insight, we propose a method called Weak-to-Strong Preference Optimization (WSPO), which achieves strong model alignment by learning the distribution differences before and after the alignment of the weak model. Experiments demonstrate that WSPO delivers outstanding performance, improving the win rate of Qwen2-7B-Instruct on Arena-Hard from 39.70 to 49.60, achieving a remarkable 47.04 length-controlled win rate on AlpacaEval 2, and scoring 7.33 on MT-bench. Our results suggest that using the weak model to elicit a strong model with a high alignment ability is feasible.
Problem

Research questions and friction points this paper is trying to address.

Aligning language models with human preferences
Transferring alignment behavior from weak to strong models
Improving model performance using Weak-to-Strong Preference Optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Weak-to-Strong Preference Optimization (WSPO) method
Transfer alignment behavior from weak to strong models
Amplification effect in model alignment performance
🔎 Similar Papers
No similar papers found.
Wenhong Zhu
Wenhong Zhu
Shanghai Jiao Tong Unviersity
Natural Language Processing
Z
Zhiwei He
Shanghai Jiao Tong University
X
Xiaofeng Wang
Shanghai Jiao Tong University
P
Pengfei Liu
Shanghai Jiao Tong University
R
Rui Wang
Shanghai Jiao Tong University