GradShield: Alignment Preserving Finetuning

📅 2026-05-13

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This work addresses the vulnerability of large language models to implicit or explicit harmful data during fine-tuning, which can compromise safety alignment. To mitigate this risk, the authors propose GradShield, a novel method that introduces the first gradient-based mechanism for assessing implicit harmfulness. GradShield computes a Fine-tuning Implicit Harmfulness Score (FIHS) and employs an adaptive thresholding algorithm to dynamically filter potentially harmful samples prior to fine-tuning. This approach significantly enhances alignment robustness while preserving model utility. Experimental results demonstrate that GradShield consistently suppresses attack success rates (ASR) below 6% across diverse fine-tuning tasks without degrading model performance, outperforming all existing baseline methods.

📝 Abstract

Large Language Models (LLMs) pose a significant risk of safety misalignment after finetuning, as models can be compromised by both explicitly and implicitly harmful data. Even some seemingly benign data can inadvertently steer a model towards misaligned behaviors. To address this, we introduce GradShield, a principled filtering method that safeguards LLMs during finetuning by identifying and removing harmful data points before they corrupt the model's alignment. It removes potentially harmful data by computing a Finetuning Implicit Harmfulness Score (FIHS) for each data point and employs an adaptive thresholding algorithm. We apply GradShield to multiple utility fine-tuning tasks across varying levels of harmful data and evaluate the safety and utility performance of the resulting LLMs using various metrics. The results show that GradShield outperforms all baseline methods, consistently maintaining an Attack Success Rate (ASR) below $6\%$ while preserving utility performance.

Problem

Research questions and friction points this paper is trying to address.

safety alignment

harmful data

fine-tuning

large language models

misalignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

GradShield

alignment preservation

implicit harmfulness

adaptive thresholding

safety finetuning

🔎 Similar Papers

Gradient Alignment Improves Test-Time Adaptation for Medical Image Segmentation

2024-08-14arXiv.orgCitations: 0

Enhancing Large Language Model Performance with Gradient-Based Parameter Selection

2024-06-21Citations: 1

Prompt-aligned Gradient for Prompt Tuning

2022-05-30IEEE International Conference on Computer VisionCitations: 320