SafeLLM: Unlearning Harmful Outputs from Large Language Models against Jailbreak Attacks

📅 2025-08-20

📈 Citations: 0

✨ Influential: 0

career value

229K/year

🤖 AI Summary

Large language models (LLMs) are vulnerable to jailbreaking attacks and prone to generating harmful or biased content, posing critical safety risks. Method: We propose the first targeted forgetting learning framework for jailbreak defense. It employs a hybrid dynamic detection mechanism combining external classifiers with internal activation analysis; identifies token-level harmful generation pathways via feed-forward network (FFN) activation tracing; and enforces irreversible suppression of harmful knowledge through constrained optimization. Contribution/Results: This work pioneers the application of targeted forgetting to jailbreak defense while preserving general capabilities. Experiments on Vicuna, LLaMA, and GPT-J demonstrate substantial reductions in success rates across diverse jailbreaking attacks. Moreover, our method maintains superior general performance compared to fine-tuning and preference optimization baselines, achieving enhanced robustness and practicality without compromising utility.

Technology Category

Application Category

📝 Abstract

Jailbreak attacks pose a serious threat to the safety of Large Language Models (LLMs) by crafting adversarial prompts that bypass alignment mechanisms, causing the models to produce harmful, restricted, or biased content. In this paper, we propose SafeLLM, a novel unlearning-based defense framework that unlearn the harmful knowledge from LLMs while preserving linguistic fluency and general capabilities. SafeLLM employs a three-stage pipeline: (1) dynamic unsafe output detection using a hybrid approach that integrates external classifiers with model-internal evaluations; (2) token-level harmful content tracing through feedforward network (FFN) activations to localize harmful knowledge; and (3) constrained optimization to suppress unsafe behavior without degrading overall model quality. SafeLLM achieves targeted and irreversible forgetting by identifying and neutralizing FFN substructures responsible for harmful generation pathways. Extensive experiments on prominent LLMs (Vicuna, LLaMA, and GPT-J) across multiple jailbreak benchmarks show that SafeLLM substantially reduces attack success rates while maintaining high general-purpose performance. Compared to standard defense methods such as supervised fine-tuning and direct preference optimization, SafeLLM offers stronger safety guarantees, more precise control over harmful behavior, and greater robustness to unseen attacks. Moreover, SafeLLM maintains the general performance after the harmful knowledge unlearned. These results highlight unlearning as a promising direction for scalable and effective LLM safety.

Problem

Research questions and friction points this paper is trying to address.

Defending LLMs against jailbreak attacks producing harmful outputs

Unlearning harmful knowledge while preserving model capabilities

Localizing and neutralizing harmful generation pathways in models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unlearning-based defense framework against jailbreak attacks

Hybrid detection with external classifiers and internal evaluations

Constrained optimization to suppress unsafe behavior

🔎 Similar Papers

No similar papers found.