AsFT: Anchoring Safety During LLM Fine-Tuning Within Narrow Safety Basin

📅 2025-06-10

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

Fine-tuning large language models (LLMs) is highly susceptible to degradation of safety alignment—even from minimal exposure to malicious or benign-but-misaligned data. This paper introduces the novel concept of the “narrow safety basin,” which formally characterizes the geometric relationship between alignment direction and safety robustness. Building upon this insight, we propose a safety-aware fine-tuning paradigm anchored on the alignment direction: we construct an alignment direction vector via weight difference between safe and unsafe models, and incorporate a direction-aware regularization term to suppress harmful parameter updates—fully compatible with efficient fine-tuning methods such as LoRA. Extensive evaluation across multiple benchmarks demonstrates that our method reduces harmful behavior by 7.60% and improves task performance by 3.44% compared to Safe LoRA, while significantly enhancing generalization and robustness against diverse safety threats.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are vulnerable to safety risks during fine-tuning, where small amounts of malicious or harmless data can compromise safeguards. In this paper, building on the concept of alignment direction -- defined by the weight difference between aligned and unaligned models -- we observe that perturbations along this direction preserve model safety. In contrast, perturbations along directions orthogonal to this alignment are strongly linked to harmful direction perturbations, rapidly degrading safety and framing the parameter space as a narrow safety basin. Based on this insight, we propose a methodology for safety fine-tuning called AsFT (Anchoring Safety in Fine-Tuning), which integrates a regularization term into the training objective. This term uses the alignment direction as an anchor to suppress updates in harmful directions, ensuring that fine-tuning is constrained within the narrow safety basin. Extensive experiments on multiple datasets show that AsFT outperforms Safe LoRA, reducing harmful behavior by 7.60 percent, improving model performance by 3.44 percent, and maintaining robust performance across various experimental settings. Code is available at https://github.com/PKU-YuanGroup/AsFT

Problem

Research questions and friction points this paper is trying to address.

Prevents safety risks during LLM fine-tuning

Suppresses harmful updates using alignment direction

Maintains model performance within narrow safety basin

Innovation

Methods, ideas, or system contributions that make the work stand out.

Regularization term anchors alignment direction

Suppresses harmful updates during fine-tuning

Constrains training within narrow safety basin

🔎 Similar Papers

Safety Layers in Aligned Large Language Models: The Key to LLM Security

2024-08-30Citations: 5

💼 Related Jobs

Researcher, Pretraining Safety

OpenAI

$295K – $445K • Offers Equity

San Francisco

AI Research Scientist - Safety Alignment Team