Toward Secure Tuning: Mitigating Security Risks from Instruction Fine-Tuning

📅 2024-10-06

📈 Citations: 5

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This work addresses the lack of effective defenses against security risks—such as backdoor attacks and jailbreaking—introduced during instruction tuning of large language models (LLMs). We propose SWAT, a secure fine-tuning strategy. Methodologically, SWAT conducts the first systematic, module-level analysis of parameter influence on safety-feature-space drift and introduces an “early-stage learning load transfer” mechanism. It warm-starts with a robust module subset (Mods_Rob) to capture foundational safety features, then performs full-parameter fine-tuning while dynamically freezing non-robust modules to suppress vulnerabilities. SWAT forms a plug-and-play, pretraining- and post-training-compatible security fine-tuning framework. Experiments across multiple datasets, models, and attack scenarios demonstrate that SWAT significantly reduces attack success rates, incurs near-zero task performance degradation, and synergistically enhances existing defense methods.

Technology Category

Application Category

📝 Abstract

Instruction fine-tuning has emerged as a critical technique for customizing Large Language Models (LLMs) to specific applications. However, recent studies have highlighted significant security vulnerabilities in fine-tuned LLMs. Existing defense efforts focus more on pre-training and post-training methods, yet there remains underexplored in in-training methods. To fill this gap, we introduce a novel secure-tuning strategy called SWAT. By analyzing how module-level parameters (e.g. Q/K/V/O) affect the security feature space drift, we identify a robust subset of modules, termed Mods_Rob. Our SWAT strategy begins by warming up Mods_Rob to capture low-level features with minimal security risks, followed by training all parameters to achieve optimal task performance. Essentially, this strategy shifts the early learning burden more from global parameters to Mods_Rob, reducing update magnitudes of the non-robust subset. Across various datasets, scenarios, and LLMs, our strategy has demonstrated significant success in mitigating security risks while preserving task performance. Importantly, it can be seamlessly integrated with pre-training and post-training methods, leading to greater improvements.

Problem

Research questions and friction points this paper is trying to address.

Mitigating security risks in fine-tuned LLMs

Exploring in-training secure-tuning strategies

Identifying robust modules for secure model training

Innovation

Methods, ideas, or system contributions that make the work stand out.

SWAT secure-tuning strategy

Module-level parameter analysis

Mods_Rob robust subset identification

🔎 Similar Papers

Safety Layers in Aligned Large Language Models: The Key to LLM Security