π€ AI Summary
This study investigates whether large language models exhibit emergent misalignment (EM)βa sudden degradation in alignment on unrelated domainsβafter task-specific fine-tuning. Through systematic experiments spanning 12 open-source models and GPT-4o, leveraging over one million model responses, multi-model family fine-tuning, checkpoint-level training trajectory analysis, and cross-domain evaluation (code and medical domains), the work demonstrates that EM is not a universal phenomenon but strongly dependent on model scale and occurs only during overtraining beyond main-task convergence. The authors propose mitigation strategies based on early stopping and learning rate scheduling: EM stably manifests in only 17% of open-source models; early stopping eliminates EM while preserving 93% of primary task performance, and prevents 67% of spurious generalization cases in the medical domain.
π Abstract
Emergent misalignment (EM), where fine-tuning on a narrow task (like insecure code) causes broad misalignment across unrelated domains, was first demonstrated by Betley et al. (2025). We conduct the most comprehensive EM study to date, reproducing the original GPT-4o finding and expanding to 12 open-source models across 4 families (Llama, Qwen, DeepSeek, GPT-OSS) ranging from 8B to 671B parameters, evaluating over one million model responses with multiple random seeds. We find that EM replicates in GPT-4o but is far from universal: only 2 of 12 open-source models (17%) exhibit consistent EM across seeds, with a significant correlation between model size and EM susceptibility. Through checkpoint-level analysis during fine-tuning, we demonstrate that EM emerges late in training, distinct from and subsequent to near convergence of the primary task, suggesting EM emerges from continued training past task convergence. This yields practical mitigations: early stopping eliminates EM while retaining an average of 93% of task performance, and careful learning rate selection further minimizes risk. Cross-domain validation on medical fine-tuning confirms these patterns generalize: the size-EM correlation strengthens (r = 0.90), and overgeneralization to untruthfulness remains avoidable via early stopping in 67% of cases, though semantically proximate training domains produce less separable misalignment. As LLMs become increasingly integrated into real-world systems, fine-tuning and reinforcement learning remain the primary methods for adapting model behavior. Our findings demonstrate that with proper training practices, EM can be avoided, reframing it from an unforeseen fine-tuning risk to an avoidable training artifact.