The Art of (Mis)alignment: How Fine-Tuning Methods Effectively Misalign and Realign LLMs in Post-Training

📅 2026-04-08

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This study addresses the vulnerability of large language models (LLMs) to malicious fine-tuning that can induce misalignment and compromise safety. The authors systematically evaluate four supervised fine-tuning (SFT) and two preference-based fine-tuning (PFT) methods in both inducing misalignment and restoring alignment, conducting experiments across four widely used safety-aligned LLMs. They find, for the first time, that ORPO is most susceptible to causing misalignment, whereas DPO demonstrates superior efficacy in realignment. The work further uncovers key asymmetries between attack and defense dynamics, persistent residual effects from multi-round adversarial interactions, and model-specific resistance patterns. These findings underscore the necessity of tailoring alignment strategies to individual model characteristics and strengthening pre-deployment safeguards—particularly for open-source models—to mitigate alignment erosion risks.

Technology Category

Application Category

📝 Abstract

The deployment of large language models (LLMs) raises significant ethical and safety concerns. While LLM alignment techniques are adopted to improve model safety and trustworthiness, adversaries can exploit these techniques to undermine safety for malicious purposes, resulting in \emph{misalignment}. Misaligned LLMs may be published on open platforms to magnify harm. To address this, additional safety alignment, referred to as \emph{realignment}, is necessary before deploying untrusted third-party LLMs. This study explores the efficacy of fine-tuning methods in terms of misalignment, realignment, and the effects of their interplay. By evaluating four Supervised Fine-Tuning (SFT) and two Preference Fine-Tuning (PFT) methods across four popular safety-aligned LLMs, we reveal a mechanism asymmetry between attack and defense. While Odds Ratio Preference Optimization (ORPO) is most effective for misalignment, Direct Preference Optimization (DPO) excels in realignment, albeit at the expense of model utility. Additionally, we identify model-specific resistance, residual effects of multi-round adversarial dynamics, and other noteworthy findings. These findings highlight the need for robust safeguards and customized safety alignment strategies to mitigate potential risks in the deployment of LLMs. Our code is available at https://github.com/zhangrui4041/The-Art-of-Mis-alignment.

Problem

Research questions and friction points this paper is trying to address.

misalignment

realignment

large language models

safety alignment

adversarial fine-tuning

Innovation

Methods, ideas, or system contributions that make the work stand out.

misalignment

realignment

fine-tuning