GloSS over Toxicity: Understanding and Mitigating Toxicity in LLMs via Global Toxic Subspace

📅 2025-05-20

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

This work uncovers the intrinsic mechanism underlying toxic generation in large language models (LLMs), revealing that a global toxic subspace—spanned across layers and parameters—is more fundamental and comprehensive than layer-wise or vector-level representations. To address this, we propose GloSS, a lightweight, four-stage detoxification method: it identifies the global toxic subspace within feed-forward network (FFN) parameters via principal component analysis (PCA), then performs parameter-space intervention through orthogonal projection and clipping. GloSS requires no fine-tuning, large-scale detoxification data, or retraining, establishing the first zero-shot, low-overhead subspace-level detoxification paradigm. Evaluated across diverse LLMs, GloSS reduces toxicity detection accuracy on ToxiGen by over 40%—significantly outperforming prior methods—while preserving over 98% of general capabilities, as measured by MMLU and BBH benchmarks.

Technology Category

Application Category

📝 Abstract

This paper investigates the underlying mechanisms of toxicity generation in Large Language Models (LLMs) and proposes an effective detoxification approach. Prior work typically considers the Feed-Forward Network (FFN) as the main source of toxicity, representing toxic regions as a set of toxic vectors or layer-wise subspaces. However, our in-depth analysis reveals that the global toxic subspace offers a more effective and comprehensive representation of toxic region within the model. Building on this insight, we propose GloSS (Global Toxic Subspace Suppression), a lightweight, four-stage method that mitigates toxicity by identifying and removing the global toxic subspace from the parameters of FFN. Experiments across a range of LLMs show that GloSS achieves state-of-the-art detoxification performance while preserving the models general capabilities, without requiring large-scale data or model retraining.

Problem

Research questions and friction points this paper is trying to address.

Investigates toxicity generation mechanisms in LLMs

Proposes global toxic subspace for detoxification

Introduces GloSS method to suppress toxicity effectively

Innovation

Methods, ideas, or system contributions that make the work stand out.

Identifies global toxic subspace in LLMs

Proposes lightweight four-stage GloSS method

Removes toxic subspace without retraining

🔎 Similar Papers

No similar papers found.