Circumventing Safety Alignment in Large Language Models Through Embedding Space Toxicity Attenuation

📅 2025-07-07

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

This work uncovers a critical vulnerability in large language models’ (LLMs) safety alignment mechanisms: adversaries can bypass toxicity detection by perturbing input semantic representations in the embedding space—without modifying the model or accessing training data. To exploit this, we propose ETTA, an embedding-layer adversarial attack framework that requires no fine-tuning and operates without labeled data. ETTA identifies toxicity-sensitive dimensions in the embedding space via linear transformation and attenuates them to achieve high-fidelity, high-success-rate alignment evasion. Evaluated on five mainstream open-source LLMs, ETTA achieves an average attack success rate of 88.61%, outperforming the best baseline by 11.34 percentage points. Notably, it maintains a 77.39% success rate even against models equipped with enhanced safety mitigations. This study is the first to systematically characterize the intrinsic mechanism of toxicity transfer in embedding space and expose fundamental limitations of current alignment strategies.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have achieved remarkable success across domains such as healthcare, education, and cybersecurity. However, this openness also introduces significant security risks, particularly through embedding space poisoning, which is a subtle attack vector where adversaries manipulate the internal semantic representations of input data to bypass safety alignment mechanisms. While previous research has investigated universal perturbation methods, the dynamics of LLM safety alignment at the embedding level remain insufficiently understood. Consequently, more targeted and accurate adversarial perturbation techniques, which pose significant threats, have not been adequately studied. In this work, we propose ETTA (Embedding Transformation Toxicity Attenuation), a novel framework that identifies and attenuates toxicity-sensitive dimensions in embedding space via linear transformations. ETTA bypasses model refusal behaviors while preserving linguistic coherence, without requiring model fine-tuning or access to training data. Evaluated on five representative open-source LLMs using the AdvBench benchmark, ETTA achieves a high average attack success rate of 88.61%, outperforming the best baseline by 11.34%, and generalizes to safety-enhanced models (e.g., 77.39% ASR on instruction-tuned defenses). These results highlight a critical vulnerability in current alignment strategies and underscore the need for embedding-aware defenses.

Problem

Research questions and friction points this paper is trying to address.

Bypassing safety alignment in LLMs via embedding manipulation

Understanding embedding-level vulnerabilities in LLM safety mechanisms

Developing effective defenses against adversarial embedding perturbations

Innovation

Methods, ideas, or system contributions that make the work stand out.

ETTA framework attenuates toxicity-sensitive embedding dimensions

Linear transformations bypass safety without fine-tuning

Achieves high attack success rate on multiple LLMs

🔎 Similar Papers

Finding Safety Neurons in Large Language Models