Why LLM Safety Guardrails Collapse After Fine-tuning: A Similarity Analysis Between Alignment and Fine-tuning Datasets

📅 2025-06-05

📈 Citations: 0

✨ Influential: 0

career value

231K/year

🤖 AI Summary

This work identifies a fundamental cause of safety guardrail degradation during downstream fine-tuning of large language models (LLMs): excessive representational similarity between upstream alignment data and downstream task data significantly undermines model safety. We systematically investigate alignment degradation through the novel lens of cross-dataset representational similarity—marking the first such analysis—and propose the paradigm “upstream data design determines fence persistence,” shifting from passive defenses (e.g., gradient correction or continual alignment) to proactive, data-centric safety engineering. Leveraging cosine similarity–based representation analysis, multi-dimensional safety evaluation (e.g., harmfulness scores), and controlled comparative fine-tuning experiments, we validate our findings across mainstream models including Llama-3 and Qwen. Results show that low-similarity upstream–downstream configurations reduce harmfulness by up to 10.33% and substantially improve jailbreak resistance—yielding actionable, empirically grounded guidelines for safe fine-tuning data selection.

Technology Category

Application Category

📝 Abstract

Recent advancements in large language models (LLMs) have underscored their vulnerability to safety alignment jailbreaks, particularly when subjected to downstream fine-tuning. However, existing mitigation strategies primarily focus on reactively addressing jailbreak incidents after safety guardrails have been compromised, removing harmful gradients during fine-tuning, or continuously reinforcing safety alignment throughout fine-tuning. As such, they tend to overlook a critical upstream factor: the role of the original safety-alignment data. This paper therefore investigates the degradation of safety guardrails through the lens of representation similarity between upstream alignment datasets and downstream fine-tuning tasks. Our experiments demonstrate that high similarity between these datasets significantly weakens safety guardrails, making models more susceptible to jailbreaks. Conversely, low similarity between these two types of datasets yields substantially more robust models and thus reduces harmfulness score by up to 10.33%. By highlighting the importance of upstream dataset design in the building of durable safety guardrails and reducing real-world vulnerability to jailbreak attacks, these findings offer actionable insights for fine-tuning service providers.

Problem

Research questions and friction points this paper is trying to address.

Investigates why LLM safety guardrails degrade after fine-tuning

Analyzes similarity between alignment and fine-tuning datasets

Shows high dataset similarity weakens safety against jailbreaks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzes alignment and fine-tuning dataset similarity

Links high similarity to weakened safety guardrails

Shows low similarity reduces harmfulness by 10.33%

🔎 Similar Papers

Safety Layers in Aligned Large Language Models: The Key to LLM Security