Booster: Tackling Harmful Fine-tuning for Large Language Models via Attenuating Harmful Perturbation

📅 2024-09-03

🏛️ arXiv.org

📈 Citations: 11

✨ Influential: 1

career value

176K/year

🤖 AI Summary

Malicious parameter perturbations during fine-tuning of large language models (LLMs) pose a critical security threat by inducing alignment collapse. Method: This work identifies “harmful perturbations” as the root cause of fine-tuning misalignment and proposes Booster—the first proactive defense framework operating *prior* to the alignment phase. Booster introduces a novel perturbation-suppression regularizer grounded in the decay of alignment loss, integrating gradient constraints with simulated perturbation evaluation within an extended optimization pipeline. Contribution/Results: Extensive experiments across multiple LLMs (Llama-2/3, Qwen) and benchmarks (AlpacaEval, MT-Bench) demonstrate that Booster significantly reduces harmful output scores (average reduction of 38.7%) while preserving downstream task performance—thereby validating its robustness and practical efficacy.

Technology Category

Application Category

📝 Abstract

Harmful fine-tuning issue citep{qi2023fine} poses serious safety concerns for Large language models' fine-tuning-as-a-service. While existing defenses citep{huang2024vaccine,rosati2024representation} have been proposed to mitigate the issue, their performances are still far away from satisfactory, and the root cause of the problem has not been fully recovered. For the first time in the literature, we in this paper show that extit{harmful perturbation} over the model weights should be the root cause of alignment-broken of harmful fine-tuning. In order to attenuate the negative impact of harmful perturbation, we propose an alignment-stage solution, dubbed Booster. Technically, along with the original alignment loss, we append a loss regularizer in the alignment stage's optimization. The regularizer ensures that the model's harmful loss reduction before/after simulated harmful perturbation is attenuated, thereby mitigating the subsequent fine-tuning risk. Empirical results show that Booster can effectively reduce the harmful score of the fine-tuned models while maintaining the performance of downstream tasks. Our code is available at url{https://github.com/git-disl/Booster}.

Problem

Research questions and friction points this paper is trying to address.

Address harmful fine-tuning attacks on large language models

Mitigate harmful perturbation causing alignment-broken issues

Propose Booster to reduce harmful scores while maintaining task performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Attenuates harmful perturbation in model weights

Appends loss regularizer during alignment stage

Reduces harmful score while maintaining task performance

🔎 Similar Papers

Antidote: Post-fine-tuning Safety Alignment for Large Language Models against Harmful Fine-tuning