🤖 AI Summary
This work addresses the safety degradation of aligned large language models (LLMs) during fine-tuning. We identify and formally define, at the parameter level, a contiguous mid-layer segment—the “safety layers”—whose parameters are decisive for malicious query detection, revealing them as the core structural basis for robust refusal capability. To preserve this capability, we propose Safety-aware Partial-Parameter Fine-Tuning (SPPFT), which freezes gradients of the safety layers during adaptation. Through layer-wise input vector variation analysis, over-refusal modeling, and parameter scaling analysis, we empirically validate that SPPFT significantly mitigates fine-tuning-induced safety deterioration while maintaining task performance and reducing computational overhead. Our findings establish that the safety layers constitute an intrinsic, functionally critical architecture for reliable refusal behavior in aligned LLMs.
📝 Abstract
Aligned LLMs are secure, capable of recognizing and refusing to answer malicious questions. However, the role of internal parameters in maintaining such security is not well understood yet, further these models can be vulnerable to security degradation when subjected to fine-tuning attacks. To address these challenges, our work uncovers the mechanism behind security in aligned LLMs at the parameter level, identifying a small set of contiguous layers in the middle of the model that are crucial for distinguishing malicious queries from normal ones, referred to as ``safety layers". We first confirm the existence of these safety layers by analyzing variations in input vectors within the model's internal layers. Additionally, we leverage the over-rejection phenomenon and parameters scaling analysis to precisely locate the safety layers. Building on these findings, we propose a novel fine-tuning approach, Safely Partial-Parameter Fine-Tuning (SPPFT), that fixes the gradient of the safety layers during fine-tuning to address the security degradation. Our experiments demonstrate that the proposed approach can significantly preserve LLM security while maintaining performance and reducing computational resources compared to full fine-tuning.