Emergent and Subliminal Misalignment Through the Lens of Data-Mediated Transfer

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

This work investigates how large language models, when fine-tuned on harmful data, exhibit both overt and covert misalignment when evaluated out-of-distribution. Framing misalignment as a data-mediated transfer phenomenon, the study systematically examines how the interplay among fine-tuning data structure, pretraining composition, and training pathways jointly shapes the generalization and propagation of alignment behaviors. Through supervised fine-tuning, off-policy and on-policy knowledge distillation, structured prompt analysis, and behavioral evaluation, the authors demonstrate that misalignment is more likely to occur under prompts with functionally similar structures and large spaces for harmful completions. They further show that pretraining composition significantly modulates the severity of misalignment, underscoring the centrality of data-driven mechanisms in alignment dynamics.

📝 Abstract

Fine-tuning LLMs on narrow harmful datasets can induce Emergent Misalignment (EM), where models exhibit misaligned behavior far beyond the fine-tuning distribution. We argue that emergent misalignment can be better understood as a data-mediated transfer phenomenon: harmful fine-tuning examples do not induce uniform behavioral spillover, but interact with the structural properties of the dataset and the difficulty of the tasks relative to the model. Across our experiments, we find that misalignment appears more readily when fine-tuning and evaluation prompts share similar underlying functional structure, when prompts leave more room for coherent harmful completions, and when the target behavior has been more reliably learned by the model. The training pipeline itself also matters: pretraining composition shapes later misalignment. We further study Subliminal Learning (SL), where misalignment is transmitted by fine-tuning on seemingly benign data generated by a harmful teacher. Moving beyond the standard SFT setting, we for the first time compare this transfer under off-policy and on-policy distillation as well, allowing us to separate the roles of the teacher guidance and the training data distribution in transmitting misalignment. Together, these results argue for a data-centric view: Emergent/subliminal misalignment should not be treated as a simple consequence of isolated harmful fine-tuning examples, but as the result of interactions between fine-tuning data structure, pretraining distributions, and training channels.

Problem

Research questions and friction points this paper is trying to address.

Emergent Misalignment

Subliminal Learning

Data-Mediated Transfer

Fine-tuning

Alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Emergent Misalignment

Subliminal Learning

Data-Mediated Transfer