GradShield: Alignment Preserving Finetuning

📅 2026-05-13
📈 Citations: 0
Influential: 0
📄 PDF

career value

194K/year
🤖 AI Summary
This work addresses the vulnerability of large language models to implicit or explicit harmful data during fine-tuning, which can compromise safety alignment. To mitigate this risk, the authors propose GradShield, a novel method that introduces the first gradient-based mechanism for assessing implicit harmfulness. GradShield computes a Fine-tuning Implicit Harmfulness Score (FIHS) and employs an adaptive thresholding algorithm to dynamically filter potentially harmful samples prior to fine-tuning. This approach significantly enhances alignment robustness while preserving model utility. Experimental results demonstrate that GradShield consistently suppresses attack success rates (ASR) below 6% across diverse fine-tuning tasks without degrading model performance, outperforming all existing baseline methods.
📝 Abstract
Large Language Models (LLMs) pose a significant risk of safety misalignment after finetuning, as models can be compromised by both explicitly and implicitly harmful data. Even some seemingly benign data can inadvertently steer a model towards misaligned behaviors. To address this, we introduce GradShield, a principled filtering method that safeguards LLMs during finetuning by identifying and removing harmful data points before they corrupt the model's alignment. It removes potentially harmful data by computing a Finetuning Implicit Harmfulness Score (FIHS) for each data point and employs an adaptive thresholding algorithm. We apply GradShield to multiple utility fine-tuning tasks across varying levels of harmful data and evaluate the safety and utility performance of the resulting LLMs using various metrics. The results show that GradShield outperforms all baseline methods, consistently maintaining an Attack Success Rate (ASR) below $6\%$ while preserving utility performance.
Problem

Research questions and friction points this paper is trying to address.

safety alignment
harmful data
fine-tuning
large language models
misalignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

GradShield
alignment preservation
implicit harmfulness
adaptive thresholding
safety finetuning
🔎 Similar Papers