Learn Your Reference Model for Real Good Alignment

📅 2024-04-15

🏛️ arXiv.org

📈 Citations: 20

✨ Influential: 1

career value

183K/year

🤖 AI Summary

Offline alignment methods for large language models (LLMs) often suffer from policy deviation due to over-optimization, degrading sample quality by drifting excessively from the reference model. To address this, we propose a dynamic Trust Region (TR) update paradigm—the first to adaptively adjust the trust region of the reference policy during training—overcoming the performance degradation inherent in static reference policies. Building upon DPO, IPO, and KTO frameworks, we introduce TR-DPO, TR-IPO, and TR-KTO, which jointly enforce trust-region constraints and enable online reference model adaptation. Extensive experiments demonstrate that our approach significantly outperforms baseline methods on AlpacaEval 2 and Arena-Hard, while simultaneously improving generation quality and safety across dialogue alignment and summarization tasks.

Technology Category

Application Category

📝 Abstract

Despite the fact that offline methods for Large Language Models (LLMs) alignment do not require a direct reward model, they remain susceptible to overoptimization. This issue arises when the trained model deviates excessively from the reference policy, leading to a decrease in sample quality. We propose a new paradigm of offline alignment methods, called Trust Region (including variants TR-DPO, TR-IPO, TR-KTO), which dynamically updates the reference policy throughout the training process. Our results show that TR alignment methods effectively mitigate overoptimization, enabling models to maintain strong performance even when substantially deviating from the initial reference policy. We demonstrate the efficacy of these approaches not only through toy examples that exhibit reduced overoptimization, but also through direct, side-by-side comparisons in specific tasks such as helpful and harmless dialogue, as well as summarization, where they surpass conventional methods. Additionally, we report significant improvements in general-purpose assistant setups with the Llama3 model on the AlpacaEval 2 and Arena-Hard benchmarks, highlighting the advantages of Trust Region methods over classical approaches.

Problem

Research questions and friction points this paper is trying to address.

Mitigates overoptimization in LLMs

Dynamically updates reference policy

Enhances performance in dialogue and summarization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic reference policy updates

Trust Region alignment methods

Mitigates overoptimization effectively

🔎 Similar Papers

No similar papers found.