🤖 AI Summary
This work addresses the vulnerability of large language models to harmful fine-tuning attacks that compromise safety alignment and elicit unsafe outputs. To counter this threat, the authors propose SPARD, a novel defense framework that integrates explicit safety projection with a determinant point process (DPP)-based strategy for selecting diverse yet relevant safe training data. Within the optimization process, SPARD alternates between projections that preserve task performance and enforce safety constraints via the proposed SPAG algorithm. Experimental results demonstrate that SPARD significantly reduces the success rates of four representative harmful fine-tuning attacks on benchmarks such as GSM8K and OpenBookQA, outperforming existing defenses while maintaining high task accuracy.
📝 Abstract
Fine-tuning large language models often undermines their safety alignment, a problem further amplified by harmful fine-tuning attacks in which adversarial data removes safeguards and induces unsafe behaviors. We propose SPARD, a defense framework that integrates Safety-Projected Alternating optimization with Relevance-Diversity aware data selection. SPARD employs SPAG, which optimizes alternatively between utility updates and explicit safety projections with a set of safe data to enforce safety constraints. To curate safe data, we introduce a Relevance-Diversity Determinantal Point Process to select compact safe data, balancing task relevance and safety coverage. Experiments on GSM8K and OpenBookQA under four harmful fine-tuning attacks demonstrate that SPARD consistently achieves the lowest average attack success rates, substantially outperforming state-of-the-art defense methods, while maintaining high task accuracy. Code is available at https://github.com/shuhao02/SPARD.