Surgery: Mitigating Harmful Fine-Tuning for Large Language Models via Attention Sink

๐Ÿ“… 2026-02-05
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the critical issue that harmful fine-tuning can compromise the safety alignment of large language models, thereby introducing severe security risks. The authors propose the "separable sink-divergence hypothesis," which, for the first time, establishes a direct link between the sign of attention head sink-divergence and its propensity for harmful behavior. Building on this insight, they introduce a defense mechanism grounded in the attention sink principle: by incorporating a regularization term that steers attention heads toward negative sink-divergence, the method actively suppresses the acquisition of harmful patterns during fine-tuning. Integrating sink-divergence statistical analysis, attention modulation, and regularized optimization, the approach achieves notable improvements in defense performanceโ€”5.90%, 11.25%, and 9.55%โ€”on the BeaverTails, HarmBench, and SorryBench benchmarks, respectively.

Technology Category

Application Category

๐Ÿ“ Abstract
Harmful fine-tuning can invalidate safety alignment of large language models, exposing significant safety risks. In this paper, we utilize the attention sink mechanism to mitigate harmful fine-tuning. Specifically, we first measure a statistic named \emph{sink divergence} for each attention head and observe that \emph{different attention heads exhibit two different signs of sink divergence}. To understand its safety implications, we conduct experiments and find that the number of attention heads of positive sink divergence increases along with the increase of the model's harmfulness when undergoing harmful fine-tuning. Based on this finding, we propose a separable sink divergence hypothesis -- \emph{attention heads associating with learning harmful patterns during fine-tuning are separable by their sign of sink divergence}. Based on the hypothesis, we propose a fine-tuning-stage defense, dubbed Surgery. Surgery utilizes a regularizer for sink divergence suppression, which steers attention heads toward the negative sink divergence group, thereby reducing the model's tendency to learn and amplify harmful patterns. Extensive experiments demonstrate that Surgery improves defense performance by 5.90\%, 11.25\%, and 9.55\% on the BeaverTails, HarmBench, and SorryBench benchmarks, respectively. Source code is available on https://github.com/Lslland/Surgery.
Problem

Research questions and friction points this paper is trying to address.

harmful fine-tuning
safety alignment
large language models
attention sink
sink divergence
Innovation

Methods, ideas, or system contributions that make the work stand out.

attention sink
sink divergence
harmful fine-tuning
safety alignment
Surgery
๐Ÿ”Ž Similar Papers
No similar papers found.
G
Guozhi Liu
School of Computer Science and Engineering, South China University of Technology, Guangzhou, Guangdong 510006, China.
Weiwei Lin
Weiwei Lin
School of Physics, Southeast University
Condensed matter physicsmaterial sciencenanotechnologymagnetismspintronics
Tiansheng Huang
Tiansheng Huang
Georgia Institute of Technology
Parallel and Distributed ComputingDistributed machine learningLLM safety
R
Ruichao Mo
School of Computer Science and Engineering, South China University of Technology, Guangzhou, Guangdong 510006, China.
Q
Qi Mu
School of Computer Science and Engineering, South China University of Technology, Guangzhou, Guangdong 510006, China.
Xiumin Wang
Xiumin Wang
South China University of Technology
Federated learningEdge intelligent computingInternet of thingsMobile Crowdsensing
Li Shen
Li Shen
Associate Professor, Sun Yat-sen University
Machine LearningOptimization