Unintentional Unalignment: Likelihood Displacement in Direct Preference Optimization

📅 2024-10-11

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

154K/year

🤖 AI Summary

This work identifies and systematically analyzes the “likelihood shift” phenomenon in Direct Preference Optimization (DPO), wherein models exhibit anomalous decreases—or even inversions—in the generation probability of preferred responses during training, sometimes producing semantically opposite or harmful outputs, thereby severely degrading safety refusal rates (e.g., dropping to 74.4% for Llama-3-8B-Instruct). We formally define this phenomenon and introduce the Centered Hidden-state Embedding Similarity (CHES) metric as a theoretically grounded, interpretable criterion for detecting and filtering likelihood-shifted samples. Leveraging a CHES-guided data curation and DPO training framework, we restore safety refusal performance to its pre-training baseline while preserving preference alignment, effectively mitigating unintended misalignment. Our core contributions are threefold: (1) formal characterization of likelihood shift as a novel failure mode in preference optimization; (2) development of CHES as a principled, interpretable diagnostic tool; and (3) provision of a practical, deployable intervention strategy grounded in empirical analysis and theoretical insight.

Technology Category

Application Category

📝 Abstract

Direct Preference Optimization (DPO) and its variants are increasingly used for aligning language models with human preferences. Although these methods are designed to teach a model to generate preferred responses more frequently relative to dispreferred responses, prior work has observed that the likelihood of preferred responses often decreases during training. The current work sheds light on the causes and implications of this counter-intuitive phenomenon, which we term likelihood displacement. We demonstrate that likelihood displacement can be catastrophic, shifting probability mass from preferred responses to responses with an opposite meaning. As a simple example, training a model to prefer $ exttt{No}$ over $ exttt{Never}$ can sharply increase the probability of $ exttt{Yes}$. Moreover, when aligning the model to refuse unsafe prompts, we show that such displacement can unintentionally lead to unalignment, by shifting probability mass from preferred refusal responses to harmful responses (e.g., reducing the refusal rate of Llama-3-8B-Instruct from 74.4% to 33.4%). We theoretically characterize that likelihood displacement is driven by preferences that induce similar embeddings, as measured by a centered hidden embedding similarity (CHES) score. Empirically, the CHES score enables identifying which training samples contribute most to likelihood displacement in a given dataset. Filtering out these samples effectively mitigated unintentional unalignment in our experiments. More broadly, our results highlight the importance of curating data with sufficiently distinct preferences, for which we believe the CHES score may prove valuable.

Problem

Research questions and friction points this paper is trying to address.

Analyzes likelihood displacement in DPO training.

Explores unintended unalignment in language models.

Proposes CHES score to mitigate training issues.

Innovation

Methods, ideas, or system contributions that make the work stand out.

CHES score identifies displacement sources

Filtering samples mitigates unalignment risks

Data curation ensures distinct preference embeddings

🔎 Similar Papers

The Crucial Role of Samplers in Online Direct Preference Optimization