Spurious Correlation Learning in Preference Optimization: Mechanisms, Consequences, and Mitigation via Tie Training

📅 2026-05-11

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Preference optimization methods such as Direct Preference Optimization (DPO) are prone to spurious correlations, leading to sycophancy, length bias, and poor generalization. This work provides the first unified characterization of the two theoretical sources of spurious correlations in preference learning—mean shift and causal-spurious leakage—and proves their irreducibility under distribution shift. To address this, the authors propose a data augmentation strategy based on tie samples, termed *tie training*, which selectively suppresses spurious learning through regularization while preserving causal learning capabilities. Theoretical analysis is established under log-linear policies, and experiments on both neural networks and large language models demonstrate that tie training effectively mitigates spurious learning without compromising performance on causally informative signals.

📝 Abstract

Preference learning methods such as Direct Preference Optimization (DPO) are known to induce reliance on spurious correlations, leading to sycophancy and length bias in today's language models and potentially severe goal misgeneralization in future systems. In this work, we provide a unified theoretical analysis of this phenomenon, characterizing the mechanisms of spurious learning, its consequences on deployment, and a provable mitigation strategy. Focusing on log-linear policies, we show that standard preference-learning objectives induce reliance on spurious features at the population level through two channels: mean spurious bias and causal--spurious correlation leakage. We then show that this reliance creates an irreducible vulnerability to distribution shift: more data from the same training distribution fails to reduce the model's dependence on spurious features. To address this, we propose tie training, a data augmentation strategy using ties (equal-utility preference pairs) to introduce data-driven regularization. We demonstrate that this approach selectively reduces spurious learning without degrading causal learning. Finally, we validate our theory on log-linear models and provide empirical evidence that both the spurious learning mechanisms and the benefits of tie training persist for neural networks and large language models.

Problem

Research questions and friction points this paper is trying to address.

spurious correlation

preference optimization

distribution shift

sycophancy

length bias

Innovation

Methods, ideas, or system contributions that make the work stand out.

spurious correlation

preference optimization

tie training