Bootstrapping Language Models with DPO Implicit Rewards

📅 2024-06-14

🏛️ arXiv.org

📈 Citations: 17

✨ Influential: 2

career value

223K/year

🤖 AI Summary

Large language models (LLMs) struggle to sustainably improve human alignment without external human feedback. To address this, we propose DICE: a self-alignment framework leveraging an implicit reward model derived from Direct Preference Optimization (DPO). Its core innovation is the *implicit reward self-guidance* mechanism—exploiting the intrinsic reward signal embedded after DPO training to autonomously generate high-quality preference data, enabling iterative retraining for multi-round self-alignment. Methodologically, DICE introduces length-regularized reward shaping to mitigate length bias in generation and incorporates experience replay to enhance preference data diversity and training stability. On AlpacaEval 2, DICE achieves over an 8% average improvement in length-controlled win rate under zero external feedback, demonstrating consistent gains across multiple base models. These results validate its generalizability and significantly advance unsupervised alignment paradigms.

Technology Category

Application Category

📝 Abstract

Human alignment in large language models (LLMs) is an active area of research. A recent groundbreaking work, direct preference optimization (DPO), has greatly simplified the process from past work in reinforcement learning from human feedback (RLHF) by bypassing the reward learning stage in RLHF. DPO, after training, provides an implicit reward model. In this work, we make a novel observation that this implicit reward model can by itself be used in a bootstrapping fashion to further align the LLM. Our approach is to use the rewards from a current LLM to construct a preference dataset, which is then used in subsequent DPO rounds. We incorporate two refinements to further improve our approach: 1) length-regularized reward shaping to make the preference dataset length-unbiased; 2) experience replay to enhance the quality of the preference dataset. Our approach, named self-alignment with DPO ImpliCit rEwards (DICE), shows great improvements in alignment. It achieves an increase of more than 8$\%$ in lengthcontrolled win rate on AlpacaEval 2 for all the different base models that we tried, without relying on external feedback. Our code is available at https://github.com/sail-sg/dice.

Problem

Research questions and friction points this paper is trying to address.

Improves human alignment in large language models using DPO implicit rewards.

Uses implicit rewards to bootstrap and refine language model alignment.

Enhances alignment without external feedback, increasing win rates significantly.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses DPO implicit rewards for bootstrapping LLMs

Implements length-regularized reward shaping

Applies experience replay to enhance dataset quality

🔎 Similar Papers

No similar papers found.