Preventing Safety Drift in Large Language Models via Coupled Weight and Activation Constraints

📅 2026-04-14

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

This work addresses the critical issue of safety alignment degradation in large language models (LLMs) during fine-tuning, which can lead to harmful outputs. The authors propose a novel approach that jointly constrains both model weights and activations, theoretically demonstrating for the first time that constraining either component in isolation is insufficient to guarantee safety. Their method restricts weight updates to a precomputed safe subspace while simultaneously regularizing activations through features identified as safety-critical by a sparse autoencoder. Extensive experiments across four prominent LLMs show that this coupled constraint strategy substantially reduces harmful output scores without compromising downstream task performance, consistently outperforming strong existing baselines.

Technology Category

Application Category

📝 Abstract

Safety alignment in Large Language Models (LLMs) remains highly fragile during fine-tuning, where even benign adaptation can degrade pre-trained refusal behaviors and enable harmful responses. Existing defenses typically constrain either weights or activations in isolation, without considering their coupled effects on safety. In this paper, we first theoretically demonstrate that constraining either weights or activations alone is insufficient for safety preservation. To robustly preserve safety alignment, we propose Coupled Weight and Activation Constraints (CWAC), a novel approach that simultaneously enforces a precomputed safety subspace on weight updates and applies targeted regularization to safety-critical features identified by sparse autoencoders. Extensive experiments across four widely used LLMs and diverse downstream tasks show that CWAC consistently achieves the lowest harmful scores with minimal impact on fine-tuning accuracy, substantially outperforming strong baselines even under high harmful data ratios.

Problem

Research questions and friction points this paper is trying to address.

Safety Alignment

Large Language Models

Fine-tuning

Safety Drift

Harmful Responses

Innovation

Methods, ideas, or system contributions that make the work stand out.

Coupled Weight and Activation Constraints

safety alignment

sparse autoencoders