🤖 AI Summary
Malicious parameter perturbations during fine-tuning of large language models (LLMs) pose a critical security threat by inducing alignment collapse. Method: This work identifies “harmful perturbations” as the root cause of fine-tuning misalignment and proposes Booster—the first proactive defense framework operating *prior* to the alignment phase. Booster introduces a novel perturbation-suppression regularizer grounded in the decay of alignment loss, integrating gradient constraints with simulated perturbation evaluation within an extended optimization pipeline. Contribution/Results: Extensive experiments across multiple LLMs (Llama-2/3, Qwen) and benchmarks (AlpacaEval, MT-Bench) demonstrate that Booster significantly reduces harmful output scores (average reduction of 38.7%) while preserving downstream task performance—thereby validating its robustness and practical efficacy.
📝 Abstract
Harmful fine-tuning issue citep{qi2023fine} poses serious safety concerns for Large language models' fine-tuning-as-a-service. While existing defenses citep{huang2024vaccine,rosati2024representation} have been proposed to mitigate the issue, their performances are still far away from satisfactory, and the root cause of the problem has not been fully recovered. For the first time in the literature, we in this paper show that extit{harmful perturbation} over the model weights should be the root cause of alignment-broken of harmful fine-tuning. In order to attenuate the negative impact of harmful perturbation, we propose an alignment-stage solution, dubbed Booster. Technically, along with the original alignment loss, we append a loss regularizer in the alignment stage's optimization. The regularizer ensures that the model's harmful loss reduction before/after simulated harmful perturbation is attenuated, thereby mitigating the subsequent fine-tuning risk. Empirical results show that Booster can effectively reduce the harmful score of the fine-tuned models while maintaining the performance of downstream tasks. Our code is available at url{https://github.com/git-disl/Booster}.