🤖 AI Summary
Open-source large language models (LLMs) are vulnerable to malicious fine-tuning that circumvents safety alignment mechanisms. Method: This paper proposes Self-Degradation Defense (SDD), a novel defense framework that theoretically identifies the fragility of alignment mechanisms as the root cause of successful malicious fine-tuning, and introduces the paradigm of “controllable capability degradation”: upon detecting harmful prompts, SDD steers the model to generate syntactically correct and semantically coherent yet task-irrelevant responses—thereby blocking harmful behavior without impairing normal functionality. Contribution/Results: Extensive experiments demonstrate that SDD significantly reduces harmful instruction execution rates across diverse malicious fine-tuning attacks (average reduction of 76.3%), while preserving original task performance with minimal degradation (<1.2%). Thus, SDD achieves an effective trade-off between security and usability.
📝 Abstract
Open-source Large Language Models (LLMs) often employ safety alignment methods to resist harmful instructions. However, recent research shows that maliciously fine-tuning these LLMs on harmful data can easily bypass these safeguards. To counter this, we theoretically uncover why malicious fine-tuning succeeds and identify potential defense strategies. Building on the theoretical analysis, we introduce the Self-Degraded Defense (SDD) framework. SDD encourages LLMs to produce high-quality but irrelevant responses to harmful prompts. When attackers attempt malicious fine-tuning, the general capability of the LLM aligned by SDD will significantly decrease, rendering it incapable of following harmful instructions. Our experimental results confirm SDD's effectiveness against such attacks.