🤖 AI Summary
Fine-tuning large language models (LLMs) often degrades alignment, leading to harmful outputs. To address this, we propose a gradient-driven selective parameter rollback method that— for the first time—identifies and leverages an intrinsic dual-stream structure (“alignment direction” vs. “harmful direction”) within aligned models, enabling precise detection and restoration of weakened harmful-direction suppression. Our method selectively rolls back only a small subset of critical parameters, preserving downstream task performance while repairing alignment. Experiments across 125 fine-tuned LLMs show a dramatic reduction in harmful response rate—from 33.25% to 1.74%—with negligible degradation in downstream accuracy, outperforming existing alignment repair techniques. Key contributions include: (i) direction-aware alignment modeling, (ii) a lightweight gradient-guided rollback mechanism, and (iii) dynamically constrained rollback optimization.
📝 Abstract
Large language models (LLMs) have demonstrated revolutionary capabilities in understanding complex contexts and performing a wide range of tasks. However, LLMs can also answer questions that are unethical or harmful, raising concerns about their applications. To regulate LLMs' responses to such questions, a training strategy called extit{alignment} can help. Yet, alignment can be unexpectedly compromised when fine-tuning an LLM for downstream tasks. This paper focuses on recovering the alignment lost during fine-tuning. We observe that there are two distinct directions inherent in an aligned LLM: the extit{aligned direction} and the extit{harmful direction}. An LLM is inclined to answer questions in the aligned direction while refusing queries in the harmful direction. Therefore, we propose to recover the harmful direction of the fine-tuned model that has been compromised. Specifically, we restore a small subset of the fine-tuned model's weight parameters from the original aligned model using gradient descent. We also introduce a rollback mechanism to avoid aggressive recovery and maintain downstream task performance. Our evaluation on 125 fine-tuned LLMs demonstrates that our method can reduce their harmful rate (percentage of answering harmful questions) from 33.25% to 1.74%, without sacrificing task performance much. In contrast, the existing methods either only reduce the harmful rate to a limited extent or significantly impact the normal functionality. Our code is available at https://github.com/kangyangWHU/LLMAlignment