Jailbreak to Protect: Buffering and Reinforcing via Temporary Jailbreaking for Safe Fine-Tuning in Large Language Models

📅 2026-05-23

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

This work addresses the critical challenge that fine-tuning large language models in a Fine-tuning-as-a-Service setting often compromises their safety alignment, inadvertently inducing harmful behaviors. To mitigate this risk, the authors propose a buffered and reinforced fine-tuning framework that leverages a temporary jailbreak mechanism to suppress harmful gradient updates during adaptation while restoring the model’s refusal capability afterward. The approach introduces removable BufferLoRA and ReinforceLoRA modules and incorporates QR decomposition for efficient model merging, achieving strong safety-performance trade-offs without requiring additional safety data. Notably, this study is the first to elucidate the protective mechanism of temporary jailbreaking at the gradient level. Experimental results demonstrate that the method significantly outperforms existing approaches under zero safety data and minimal computational overhead, establishing a new state-of-the-art balance between task utility and safety.

📝 Abstract

Fine-tuning-as-a-Service (FaaS) enables personalization of large language models (LLMs), but it can weaken safety-alignment under harmful fine-tuning attacks. Recent work has shown that activating harmful-behavior modules during fine-tuning can prevent models from learning undesired behaviors, but its mechanism remains unclear. In this paper, we revisit temporary jailbreaking as a defense against harmful fine-tuning and provide a gradient-level analysis showing that it saturates safety-degrading gradients while preserving benign task-relevant gradients. Based on this insight, we propose a Buffer-and-Reinforce fine-tuning framework that buffers harmful updates during user fine-tuning and reinforces safety after adaptation. Specifically, BufferLoRA induces temporary jailbreaking as a removable adapter to reduce harmful updates during user fine-tuning. After adaptation, ReinforceLoRA, trained to recover refusal behavior under the temporarily jailbroken state, is integrated with UserLoRA via QR decomposition-based merging to reinforce safety while preserving user-task performance. Extensive experiments show that our framework achieves superior safety and utility with no additional safety data during user fine-tuning and minimal computational cost.

Problem

Research questions and friction points this paper is trying to address.

harmful fine-tuning

safety alignment

large language models

fine-tuning-as-a-Service

jailbreaking

Innovation

Methods, ideas, or system contributions that make the work stand out.

temporary jailbreaking

Buffer-and-Reinforce

safety alignment