Alleviating the Fear of Losing Alignment in LLM Fine-tuning

📅 2025-04-13

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

Fine-tuning large language models (LLMs) often degrades alignment, leading to harmful outputs. To address this, we propose a gradient-driven selective parameter rollback method that— for the first time—identifies and leverages an intrinsic dual-stream structure (“alignment direction” vs. “harmful direction”) within aligned models, enabling precise detection and restoration of weakened harmful-direction suppression. Our method selectively rolls back only a small subset of critical parameters, preserving downstream task performance while repairing alignment. Experiments across 125 fine-tuned LLMs show a dramatic reduction in harmful response rate—from 33.25% to 1.74%—with negligible degradation in downstream accuracy, outperforming existing alignment repair techniques. Key contributions include: (i) direction-aware alignment modeling, (ii) a lightweight gradient-guided rollback mechanism, and (iii) dynamically constrained rollback optimization.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have demonstrated revolutionary capabilities in understanding complex contexts and performing a wide range of tasks. However, LLMs can also answer questions that are unethical or harmful, raising concerns about their applications. To regulate LLMs' responses to such questions, a training strategy called extit{alignment} can help. Yet, alignment can be unexpectedly compromised when fine-tuning an LLM for downstream tasks. This paper focuses on recovering the alignment lost during fine-tuning. We observe that there are two distinct directions inherent in an aligned LLM: the extit{aligned direction} and the extit{harmful direction}. An LLM is inclined to answer questions in the aligned direction while refusing queries in the harmful direction. Therefore, we propose to recover the harmful direction of the fine-tuned model that has been compromised. Specifically, we restore a small subset of the fine-tuned model's weight parameters from the original aligned model using gradient descent. We also introduce a rollback mechanism to avoid aggressive recovery and maintain downstream task performance. Our evaluation on 125 fine-tuned LLMs demonstrates that our method can reduce their harmful rate (percentage of answering harmful questions) from 33.25% to 1.74%, without sacrificing task performance much. In contrast, the existing methods either only reduce the harmful rate to a limited extent or significantly impact the normal functionality. Our code is available at https://github.com/kangyangWHU/LLMAlignment

Problem

Research questions and friction points this paper is trying to address.

Recovering alignment lost during LLM fine-tuning

Reducing harmful responses in fine-tuned LLMs

Balancing alignment recovery with task performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Restore weights from original aligned model

Introduce rollback mechanism for stability

Reduce harmful rate significantly without performance loss

🔎 Similar Papers

Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt Templates