Language Models Resist Alignment: Evidence From Data Compression

📅 2024-06-10
📈 Citations: 2
Influential: 1
📄 PDF
🤖 AI Summary
This work investigates whether alignment fine-tuning fundamentally alters large language models’ (LLMs) intrinsic behavioral priors or merely induces superficial, easily reversible modifications. We find that aligned models exhibit pronounced *elasticity*: they rapidly revert to pretraining behavioral distributions under subsequent fine-tuning. Leveraging compression theory, we formally prove— for the first time—that alignment effects are substantially less robust than those of pretraining. Through information-theoretic compression analysis, multi-scale behavioral tracking, quantitative distributional shift measurement, and cross-model experiments, we empirically establish that this elasticity is pervasive, decays in a characteristic two-phase pattern, and intensifies markedly with increasing model size and pretraining data volume. Our core contribution is the theoretical and empirical revelation of alignment’s inherent fragility—demonstrating that alignment constitutes a shallow, non-persistent perturbation atop deep, stable pretraining priors—thereby providing foundational insights and evidence for developing robust alignment methodologies.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) may exhibit unintended or undesirable behaviors. Recent works have concentrated on aligning LLMs to mitigate harmful outputs. Despite these efforts, some anomalies indicate that even a well-conducted alignment process can be easily circumvented, whether intentionally or accidentally. Does alignment fine-tuning yield have robust effects on models, or are its impacts merely superficial? In this work, we make the first exploration of this phenomenon from both theoretical and empirical perspectives. Empirically, we demonstrate the elasticity of post-alignment models, i.e., the tendency to revert to the behavior distribution formed during the pre-training phase upon further fine-tuning. Leveraging compression theory, we formally deduce that fine-tuning disproportionately undermines alignment relative to pre-training, potentially by orders of magnitude. We validate the presence of elasticity through experiments on models of varying types and scales. Specifically, we find that model performance declines rapidly before reverting to the pre-training distribution, after which the rate of decline drops significantly. Furthermore, we further reveal that elasticity positively correlates with the increased model size and the expansion of pre-training data. Our findings underscore the need to address the inherent elasticity of LLMs to mitigate their resistance to alignment.
Problem

Research questions and friction points this paper is trying to address.

Alignment fine-tuning effects on LLMs are superficial not robust
Post-alignment models revert to pre-training behavior upon fine-tuning
Elasticity increases with model size and pre-training data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leveraging compression theory for alignment analysis
Demonstrating elasticity in post-alignment models
Correlating elasticity with model size and data
🔎 Similar Papers
No similar papers found.
J
Jiaming Ji
PKU-Alignment Team, Peking University
Kaile Wang
Kaile Wang
Peking University
T
Tianyi Qiu
PKU-Alignment Team, Peking University
B
Boyuan Chen
PKU-Alignment Team, Peking University
J
Jiayi Zhou
PKU-Alignment Team, Peking University
C
Changye Li
PKU-Alignment Team, Peking University
Hantao Lou
Hantao Lou
Peking University
AI AlignmentAI SafetyInterpretabilityTrustworthy AI
Josef Dai
Josef Dai
Zhejiang University
Alignment
Y
Yunhuai Liu
Y
Yaodong Yang