Robust LLM safeguarding via refusal feature adversarial training

📅 2024-09-30

🏛️ arXiv.org

📈 Citations: 4

✨ Influential: 0

career value

189K/year

🤖 AI Summary

Large language models (LLMs) are vulnerable to adversarial attacks that elicit harmful outputs, yet existing defenses suffer from opacity and high computational overhead. This work identifies and formalizes “Refusal Feature Erasure” (RFE) as a general attack mechanism: adversaries perturb safety-critical features in the residual stream—specifically those encoding refusal intent—to bypass alignment safeguards. Building on this insight, we propose ReFAT, an efficient adversarial training framework that operates via latent-space, feature-level perturbations—transforming input-level adversarial training into targeted manipulation within the refusal feature subspace. Evaluated on three mainstream LLMs, ReFAT improves average defense success rate by 27% while reducing training cost by ~65% compared to conventional adversarial training. Our approach establishes a new paradigm for robust LLM alignment: interpretable, feature-grounded, and computationally lightweight.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are vulnerable to adversarial attacks that can elicit harmful responses. Defending against such attacks remains challenging due to the opacity of jailbreaking mechanisms and the high computational cost of training LLMs robustly. We demonstrate that adversarial attacks share a universal mechanism for circumventing LLM safeguards that works by ablating a dimension in the residual stream embedding space called the refusal feature. We further show that the operation of refusal feature ablation (RFA) approximates the worst-case perturbation of offsetting model safety. Based on these findings, we propose Refusal Feature Adversarial Training (ReFAT), a novel algorithm that efficiently performs LLM adversarial training by simulating the effect of input-level attacks via RFA. Experiment results show that ReFAT significantly improves the robustness of three popular LLMs against a wide range of adversarial attacks, with considerably less computational overhead compared to existing adversarial training methods.

Problem

Research questions and friction points this paper is trying to address.

Defends LLMs against adversarial attacks causing harmful responses.

Addresses opacity of jailbreaking mechanisms and high training costs.

Improves LLM robustness with efficient adversarial training algorithm.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adversarial training via refusal feature ablation

Simulates input-level attacks efficiently

Reduces computational overhead significantly

🔎 Similar Papers

Safety Layers in Aligned Large Language Models: The Key to LLM Security