Adaptive Detoxification: Safeguarding General Capabilities of LLMs through Toxicity-Aware Knowledge Editing

📅 2025-05-28

📈 Citations: 1

✨ Influential: 0

career value

170K/year

🤖 AI Summary

Large language models (LLMs) are vulnerable to malicious prompts and jailbreaking attacks, while existing knowledge editing methods rely on explicit entity identification and often cause over-editing. To address these issues, this paper proposes ToxEdit, a toxicity-aware knowledge editing framework. Its core innovation lies in dynamically detecting toxicity-activated patterns during forward propagation and performing fine-grained detoxification via adaptive inter-layer computational path rerouting—eliminating dependence on entity localization and preserving legitimate instruction compliance. ToxEdit integrates toxicity pattern recognition, learnable routing mechanisms, and knowledge editing, and introduces an enhanced SafeEdit benchmark incorporating instruction-following evaluation. Experiments across multiple LLMs demonstrate that ToxEdit significantly outperforms state-of-the-art methods: it achieves superior detoxification, maintains general capabilities, reduces over-editing by 37%, and sustains instruction-following accuracy above 98.2%.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) exhibit impressive language capabilities but remain vulnerable to malicious prompts and jailbreaking attacks. Existing knowledge editing methods for LLM detoxification face two major challenges. First, they often rely on entity-specific localization, making them ineffective against adversarial inputs without explicit entities. Second, these methods suffer from over-editing, where detoxified models reject legitimate queries, compromising overall performance. In this paper, we propose ToxEdit, a toxicity-aware knowledge editing approach that dynamically detects toxic activation patterns during forward propagation. It then routes computations through adaptive inter-layer pathways to mitigate toxicity effectively. This design ensures precise toxicity mitigation while preserving LLMs' general capabilities. To more accurately assess over-editing, we also enhance the SafeEdit benchmark by incorporating instruction-following evaluation tasks. Experimental results on multiple LLMs demonstrate that our ToxEdit outperforms previous state-of-the-art methods in both detoxification performance and safeguarding general capabilities of LLMs.

Problem

Research questions and friction points this paper is trying to address.

Mitigating toxicity in LLMs without explicit entity reliance

Preventing over-editing to maintain model performance

Dynamic detection and routing of toxic activation patterns

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic toxic activation pattern detection

Adaptive inter-layer computation routing

Enhanced SafeEdit benchmark evaluation

🔎 Similar Papers

Contrastive Perplexity for Controlled Generation: An Application in Detoxifying Large Language Models