Noise Injection Systemically Degrades Large Language Model Safety Guardrails

📅 2025-05-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work exposes a fundamental vulnerability of large language models (LLMs) under non-adversarial perturbations: injecting Gaussian noise (σ = 0.1–0.5) into hidden-layer activations increases harmful output rates by up to 27% (p < 0.001). Contrary to expectations, deep safety fine-tuning methods—including RLHF and DPO—fail to improve noise robustness. Crucially, while reasoning capabilities (e.g., chain-of-thought) remain intact, safety degrades sharply, revealing an intrinsic misalignment between safety mechanisms and internal representations. Experiments span major open-source models (Llama-3, Qwen, Phi-3) and multiple benchmarks (ToxiGen, SafeBench), confirming the phenomenon’s cross-model and cross-benchmark generality. These findings demonstrate that current safety alignment techniques lack generalizable robustness against benign distributional shifts, underscoring an urgent need for novel alignment paradigms explicitly designed for intrinsic robustness.

Technology Category

Application Category

📝 Abstract
Safety guardrails in large language models (LLMs) are a critical component in preventing harmful outputs. Yet, their resilience under perturbation remains poorly understood. In this paper, we investigate the robustness of safety fine-tuning in LLMs by systematically injecting Gaussian noise into model activations. We show across multiple open-weight models that (1) Gaussian noise raises harmful-output rates (p<0.001) by up to 27%, (2) that deeper safety fine-tuning affords no extra protection, and (3) that chain-of-thought reasoning remains largely intact. The findings reveal critical vulnerabilities in current safety alignment techniques and highlight the potential of reasoning-based and reinforcement learning approaches as promising direction for developing more robust AI safety systems. These results have important implications for real-world deployment of LLMs in safety-critical applications as these results imply that widely-deployed safety tuning methods can fail even without adversarial prompts.
Problem

Research questions and friction points this paper is trying to address.

Noise injection degrades LLM safety guardrails significantly
Current safety fine-tuning lacks robustness against perturbations
Existing safety alignment methods fail without adversarial prompts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Inject Gaussian noise into model activations
Test safety fine-tuning robustness systematically
Explore reasoning-based reinforcement learning solutions
🔎 Similar Papers
No similar papers found.