Safety Anchor: Defending Harmful Fine-tuning via Geometric Bottlenecks

πŸ“… 2026-05-07
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

209K/year
πŸ€– AI Summary
This work addresses the vulnerability of large language models to safety misalignment under harmful fine-tuning attacks, a challenge exacerbated by the ineffectiveness of conventional defenses in high-dimensional parameter spaces due to redundancy. To overcome this limitation, the authors propose Safety Bottleneck Regularization, a novel defense strategy that concentrates protection on the output embedding layerβ€”a geometric bottleneck in the model architecture. By anchoring the final hidden states of harmful queries to a single safe reference point, the method enforces alignment with safe responses without compromising performance on benign downstream tasks. Empirical results demonstrate that this approach substantially enhances robustness against persistent harmful fine-tuning, reducing harmfulness scores to below 10 and thereby transcending the constraints of traditional parameter-based regularization paradigms.
πŸ“ Abstract
The safety alignment of Large Language Models (LLMs) remains vulnerable to Harmful Fine-tuning (HFT). While existing defenses impose constraints on parameters, gradients, or internal representations, we observe that they can be effectively circumvented under persistent HFT. Our analysis traces this failure to the inherent redundancy of the high-dimensional parameter space: attackers exploit optimization trajectories that are orthogonal to defense constraints to restore harmful capabilities while deceptively adhering to safety restrictions. To address this, we propose Safety Bottleneck Regularization (SBR). SBR shifts the defensive focus from the redundant parameter space to the unembedding layer, which serves as a geometric bottleneck. By anchoring the final hidden states of harmful queries to those of the safety-aligned model, SBR enables the model to maintain safe responses even under persistent HFT. Extensive experiments confirm SBR's effectiveness, demonstrating that utilizing just a single safety anchor is sufficient to reduce the Harmful Score to $<$10 while preserving competitive performance on benign downstream tasks.
Problem

Research questions and friction points this paper is trying to address.

Safety Alignment
Harmful Fine-tuning
Large Language Models
Defense Vulnerability
Geometric Bottleneck
Innovation

Methods, ideas, or system contributions that make the work stand out.

Safety Bottleneck Regularization
Harmful Fine-tuning
Geometric Bottleneck
Unembedding Layer
Safety Alignment