Attributing and Exploiting Safety Vectors through Global Optimization in Large Language Models

📅 2026-01-22

📈 Citations: 0

✨ Influential: 0

career value

243K/year

🤖 AI Summary

This work addresses the vulnerability of large language models (LLMs) to jailbreak attacks, highlighting that existing attribution methods fail to accurately identify critical safety modules due to their neglect of inter-component interactions. The authors propose the first global optimization framework that simultaneously analyzes all attention heads, integrating harmful patching with zero-ablation strategies. This approach reveals two distinct, spatially separated safety vectors with minimal overlap: one suppressing malicious injections and the other activating safe responses, thereby uncovering a dual-path mechanism underlying model safety. Experiments demonstrate that perturbing only about 30% of attention heads can fully compromise the safety mechanism. Furthermore, the designed white-box jailbreak attack significantly outperforms existing methods across multiple mainstream LLMs, validating both the effectiveness and the novel insights of the proposed framework.

Technology Category

Application Category

📝 Abstract

While Large Language Models (LLMs) are aligned to mitigate risks, their safety guardrails remain fragile against jailbreak attacks. This reveals limited understanding of components governing safety. Existing methods rely on local, greedy attribution that assumes independent component contributions. However, they overlook the cooperative interactions between different components in LLMs, such as attention heads, which jointly contribute to safety mechanisms. We propose \textbf{G}lobal \textbf{O}ptimization for \textbf{S}afety \textbf{V}ector Extraction (GOSV), a framework that identifies safety-critical attention heads through global optimization over all heads simultaneously. We employ two complementary activation repatching strategies: Harmful Patching and Zero Ablation. These strategies identify two spatially distinct sets of safety vectors with consistently low overlap, termed Malicious Injection Vectors and Safety Suppression Vectors, demonstrating that aligned LLMs maintain separate functional pathways for safety purposes. Through systematic analyses, we find that complete safety breakdown occurs when approximately 30\% of total heads are repatched across all models. Building on these insights, we develop a novel inference-time white-box jailbreak method that exploits the identified safety vectors through activation repatching. Our attack substantially outperforms existing white-box attacks across all test models, providing strong evidence for the effectiveness of the proposed GOSV framework on LLM safety interpretability.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

safety mechanisms

jailbreak attacks

attention heads

component interactions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Global Optimization

Safety Vectors

Activation Repatching