Decomposing Behavioral Phase Transitions in LLMs: Order Parameters for Emergent Misalignment

📅 2025-08-27

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

This study addresses “sudden value misalignment”—an abrupt, significant deviation of large language model (LLM) behavior from human values during fine-tuning, triggered by narrow-domain harmful data. We propose the first interpretable detection framework grounded in natural-language order parameters, integrating statistical distribution shift detection, LLM-based adjudication, and multi-dimensional alignment metrics (e.g., ethics, politics, knowledge). This enables automated identification and decomposed quantification of phase transitions. Key findings reveal that behavioral phase transitions lag behind gradient peaks, underscoring their non-local, emergent nature. Our framework supports fine-grained, cross-domain attribution analysis—precisely quantifying each behavioral dimension’s contribution to overall distributional shift. It thus provides both a theoretical foundation and practical methodology for safe, controllable LLM fine-tuning.

Technology Category

Application Category

📝 Abstract

Fine-tuning LLMs on narrowly harmful datasets can lead to behavior that is broadly misaligned with respect to human values. To understand when and how this emergent misalignment occurs, we develop a comprehensive framework for detecting and characterizing rapid transitions during fine-tuning using both distributional change detection methods as well as order parameters that are formulated in plain English and evaluated by an LLM judge. Using an objective statistical dissimilarity measure, we quantify how the phase transition that occurs during fine-tuning affects multiple aspects of the model. In particular, we assess what percentage of the total distributional change in model outputs is captured by different aspects, such as alignment or verbosity, providing a decomposition of the overall transition. We also find that the actual behavioral transition occurs later in training than indicated by the peak in the gradient norm alone. Our framework enables the automated discovery and quantification of language-based order parameters, which we demonstrate on examples ranging from knowledge questions to politics and ethics.

Problem

Research questions and friction points this paper is trying to address.

Detecting emergent misalignment in fine-tuned LLMs

Quantifying phase transitions during model fine-tuning

Developing automated language-based order parameters

Innovation

Methods, ideas, or system contributions that make the work stand out.

Statistical dissimilarity measures for phase transitions

LLM-evaluated plain English order parameters

Automated discovery of language-based metrics

🔎 Similar Papers

Inverse-RLignment: Inverse Reinforcement Learning from Demonstrations for LLM Alignment