🤖 AI Summary
This work addresses the lack of effective defenses against security risks—such as backdoor attacks and jailbreaking—introduced during instruction tuning of large language models (LLMs). We propose SWAT, a secure fine-tuning strategy. Methodologically, SWAT conducts the first systematic, module-level analysis of parameter influence on safety-feature-space drift and introduces an “early-stage learning load transfer” mechanism. It warm-starts with a robust module subset (Mods_Rob) to capture foundational safety features, then performs full-parameter fine-tuning while dynamically freezing non-robust modules to suppress vulnerabilities. SWAT forms a plug-and-play, pretraining- and post-training-compatible security fine-tuning framework. Experiments across multiple datasets, models, and attack scenarios demonstrate that SWAT significantly reduces attack success rates, incurs near-zero task performance degradation, and synergistically enhances existing defense methods.
📝 Abstract
Instruction fine-tuning has emerged as a critical technique for customizing Large Language Models (LLMs) to specific applications. However, recent studies have highlighted significant security vulnerabilities in fine-tuned LLMs. Existing defense efforts focus more on pre-training and post-training methods, yet there remains underexplored in in-training methods. To fill this gap, we introduce a novel secure-tuning strategy called SWAT. By analyzing how module-level parameters (e.g. Q/K/V/O) affect the security feature space drift, we identify a robust subset of modules, termed Mods_Rob. Our SWAT strategy begins by warming up Mods_Rob to capture low-level features with minimal security risks, followed by training all parameters to achieve optimal task performance. Essentially, this strategy shifts the early learning burden more from global parameters to Mods_Rob, reducing update magnitudes of the non-robust subset. Across various datasets, scenarios, and LLMs, our strategy has demonstrated significant success in mitigating security risks while preserving task performance. Importantly, it can be seamlessly integrated with pre-training and post-training methods, leading to greater improvements.