Triple Preference Optimization: Achieving Better Alignment using a Single Step Optimization

📅 2024-05-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing preference optimization methods (e.g., DPO) often degrade large language models’ reasoning capabilities and exhibit high sensitivity to noise and scale in preference data. To address this, we propose a one-step ternary preference learning framework that introduces, for the first time, a unified ternary preference structure jointly modeling response quality, instruction alignment, and logical rigor. Our approach employs implicit reward modeling to integrate contrastive learning with gradient constraints, enabling simultaneous enhancement of instruction following and complex reasoning within a single optimization step. The method is compatible with diverse base and instruction-tuned models (e.g., Mistral, Llama 3). Empirical results demonstrate substantial improvements over DPO across rigorous benchmarks—+7.0% to +19.0% on Arena-Hard, MixEval-Hard, MMLU-Pro, and GSM8K—achieving superior generalization with less training data and without increasing output length.

Technology Category

Application Category

📝 Abstract
Reinforcement Learning with Human Feedback (RLHF) enhances the alignment of Large Language Models (LLMs). However, its limitations have led to the development of Direct Preference Optimization (DPO), an RL-free approach designed to overcome these shortcomings. While studies have shown that DPO improves instruction-following capabilities, it negatively impacts the reasoning ability of LLMs. Additionally, DPO is highly sensitive to judgment noise in preference datasets and the size of the training set. Although several modifications to DPO have been proposed, they still fail to fully resolve these issues. To address these limitations, we propose Triple Preference Optimization (TPO), a new preference learning method designed to enhance both reasoning and instruction-following abilities through one-step optimization. We compare TPO against DPO and its recent variants using state-of-the-art training setups, including both base and instruction-tuned models such as Mistral and Llama 3. Our evaluation covers a comprehensive range of chat-based and reasoning benchmarks. The results demonstrate that TPO achieves significant improvements over existing methods without substantially increasing response length across different dataset sizes. Specifically, TPO outperforms DPO and SimPO by up to 7.0% and 7.3% points on Arena-Hard, 12.2% and 13.3% points on MixEval-Hard, 10.4% and 10.1% points on MMLU-Pro, and 19.0% and 19.2% points on GSM8K, respectively. Furthermore, TPO achieves these improvements while requiring less data than DPO.
Problem

Research questions and friction points this paper is trying to address.

Enhances reasoning and instruction-following abilities in LLMs.
Reduces sensitivity to judgment noise in preference datasets.
Achieves better alignment with less data than DPO.
Innovation

Methods, ideas, or system contributions that make the work stand out.

One-step optimization for alignment
Enhances reasoning and instruction-following
Less data required than DPO
🔎 Similar Papers
No similar papers found.