Towards Identification and Intervention of Safety-Critical Parameters in Large Language Models

📅 2026-04-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current safety mechanisms for large language models (LLMs) suffer from poor interpretability and struggle to enable precise interventions across diverse tasks. This work proposes the Expected Safety Impact (ESI) framework, which systematically uncovers, for the first time, the distribution patterns of safety-critical parameters across LLMs with different architectures. Building on these insights, the authors introduce two efficient intervention paradigms: Safety Enhancement Tuning (SET) to improve the safety of unaligned models, and Safety Preserving Adaptation (SPA) to maintain the safety of aligned models during adaptation. Experiments demonstrate that SET reduces attack success rates by over 50% with updates to only 1% of parameters in just 100 iterations, while SPA constrains safety degradation to less than 1% even after 1,000 instruction-tuning steps.
📝 Abstract
Ensuring Large Language Model (LLM) safety is crucial, yet the lack of a clear understanding about safety mechanisms hinders the development of precise and reliable methodologies for safety intervention across diverse tasks. To better understand and control LLM safety, we propose the Expected Safety Impact (ESI) framework for quantifying how different parameters affect LLM safety. Based on ESI, we reveal distinct safety-critical patterns across different LLM architectures: In dense LLMs, many safety-critical parameters are located in value matrices (V) and MLPs in middle layers, whereas in Mixture-of-Experts (MoE) models, they shift to the late-layer MLPs. Leveraging ESI, we further introduce two targeted intervention paradigms for safety enhancement and preservation, i.e., Safety Enhancement Tuning (SET) and Safety Preserving Adaptation (SPA). SET can align unsafe LLMs by updating only a few safety-critical parameters, effectively enhancing safety while preserving original performance. SPA safeguards well-aligned LLMs during capability-oriented intervention (e.g., instruction tuning) by preventing disruption of safety-critical weights, allowing the LLM to acquire new abilities and maintain safety capabilities. Extensive evaluations on different LLMs demonstrate that SET can reduce the attack success rates of unaligned LLMs by over 50% with only a 100-iteration update on 1% of model weights. SPA can limit the safety degradation of aligned LLMs within 1% after a 1,000-iteration instruction fine-tuning on different tasks. Our code is available at: https://github.com/ZJU-LLM-Safety/SafeWeights-ACL.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Safety Mechanisms
Safety Intervention
Model Alignment
Parameter Safety
Innovation

Methods, ideas, or system contributions that make the work stand out.

Expected Safety Impact
Safety-Critical Parameters
Safety Enhancement Tuning
Safety Preserving Adaptation
Large Language Model Safety
🔎 Similar Papers
No similar papers found.