Targeting Alignment: Extracting Safety Classifiers of Aligned LLMs

📅 2025-01-27

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

This work addresses the security robustness of large language models (LLMs) against jailbreaking attacks, systematically revealing—for the first time—an extractable, latent safety classifier embedded within their alignment mechanisms. We propose a surrogate classifier construction method based on sub-architecture search: by performing layer- or module-level model slicing, identifying safety-critical subnetworks, and distilling classification capabilities, we localize key safety substructures in aligned LLMs and build lightweight surrogate models. Experiments demonstrate that a surrogate classifier using only 20% of the original parameters achieves over 80% F1 score on safety classification; on Llama-2, a 50%-parameter surrogate increases jailbreak detection success rate to 70%, substantially outperforming direct attack attempts (22%). This work establishes a novel paradigm for interpreting the internal structure of alignment mechanisms, evaluating safety vulnerabilities, and defending against jailbreaking attacks.

Technology Category

Application Category

📝 Abstract

Alignment in large language models (LLMs) is used to enforce guidelines such as safety. Yet, alignment fails in the face of jailbreak attacks that modify inputs to induce unsafe outputs. In this paper, we present and evaluate a method to assess the robustness of LLM alignment. We observe that alignment embeds a safety classifier in the target model that is responsible for deciding between refusal and compliance. We seek to extract an approximation of this classifier, called a surrogate classifier, from the LLM. We develop an algorithm for identifying candidate classifiers from subsets of the LLM model. We evaluate the degree to which the candidate classifiers approximate the model's embedded classifier in benign (F1 score) and adversarial (using surrogates in a white-box attack) settings. Our evaluation shows that the best candidates achieve accurate agreement (an F1 score above 80%) using as little as 20% of the model architecture. Further, we find attacks mounted on the surrogate models can be transferred with high accuracy. For example, a surrogate using only 50% of the Llama 2 model achieved an attack success rate (ASR) of 70%, a substantial improvement over attacking the LLM directly, where we only observed a 22% ASR. These results show that extracting surrogate classifiers is a viable (and highly effective) means for modeling (and therein addressing) the vulnerability of aligned models to jailbreaking attacks.

Problem

Research questions and friction points this paper is trying to address.

large language models

effective proxy checker

escape attack security

Innovation

Methods, ideas, or system contributions that make the work stand out.

Proxy Inspectors

Security Enhancement

Large Language Models

🔎 Similar Papers

Finding Safety Neurons in Large Language Models

2024-06-20arXiv.orgCitations: 26

Cross-Modal Safety Alignment: Is textual unlearning all you need?

2024-05-27arXiv.orgCitations: 21

Anthropic

$350,000—$500,000 USD

San Francisco, CA, USA

AI Research Scientist - Safety Alignment Team