Safety at One Shot: Patching Fine-Tuned LLMs with A Single Instance

📅 2026-01-05

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

191K/year

🤖 AI Summary

This work addresses the critical issue that fine-tuning safety-aligned large language models (LLMs) often leads to a significant degradation in their safety guarantees. Existing remediation approaches typically require extensive safe training samples or calibration sets, incurring high computational costs and compromising model utility. To overcome these limitations, we propose an efficient repair method that leverages only a single safe example to fully restore model safety within just a few training epochs, without sacrificing general performance. We further provide the first theoretical insight by revealing that safety-related gradients exhibit a low-rank structure, which underpins the efficacy of single-sample repair. Extensive experiments across five mainstream safety-aligned LLMs and multiple datasets demonstrate that our approach consistently and rapidly recovers safety across varying model scales and proportions of harmful examples.

Technology Category

Application Category

📝 Abstract

Fine-tuning safety-aligned large language models (LLMs) can substantially compromise their safety. Previous approaches require many safety samples or calibration sets, which not only incur significant computational overhead during realignment but also lead to noticeable degradation in model utility. Contrary to this belief, we show that safety alignment can be fully recovered with only a single safety example, without sacrificing utility and at minimal cost. Remarkably, this recovery is effective regardless of the number of harmful examples used in fine-tuning or the size of the underlying model, and convergence is achieved within just a few epochs. Furthermore, we uncover the low-rank structure of the safety gradient, which explains why such efficient correction is possible. We validate our findings across five safety-aligned LLMs and multiple datasets, demonstrating the generality of our approach.

Problem

Research questions and friction points this paper is trying to address.

safety alignment

fine-tuning

large language models

computational overhead

model utility

Innovation

Methods, ideas, or system contributions that make the work stand out.

one-shot safety alignment

low-rank gradient

fine-tuned LLMs

safety recovery

minimal calibration

🔎 Similar Papers

Safe LoRA: the Silver Lining of Reducing Safety Risks when Fine-tuning Large Language Models

2024-05-27arXiv.orgCitations: 15

Safety Layers in Aligned Large Language Models: The Key to LLM Security

2024-08-30Citations: 5