🤖 AI Summary
This work uncovers the intrinsic mechanism underlying toxic generation in large language models (LLMs), revealing that a global toxic subspace—spanned across layers and parameters—is more fundamental and comprehensive than layer-wise or vector-level representations. To address this, we propose GloSS, a lightweight, four-stage detoxification method: it identifies the global toxic subspace within feed-forward network (FFN) parameters via principal component analysis (PCA), then performs parameter-space intervention through orthogonal projection and clipping. GloSS requires no fine-tuning, large-scale detoxification data, or retraining, establishing the first zero-shot, low-overhead subspace-level detoxification paradigm. Evaluated across diverse LLMs, GloSS reduces toxicity detection accuracy on ToxiGen by over 40%—significantly outperforming prior methods—while preserving over 98% of general capabilities, as measured by MMLU and BBH benchmarks.
📝 Abstract
This paper investigates the underlying mechanisms of toxicity generation in Large Language Models (LLMs) and proposes an effective detoxification approach. Prior work typically considers the Feed-Forward Network (FFN) as the main source of toxicity, representing toxic regions as a set of toxic vectors or layer-wise subspaces. However, our in-depth analysis reveals that the global toxic subspace offers a more effective and comprehensive representation of toxic region within the model. Building on this insight, we propose GloSS (Global Toxic Subspace Suppression), a lightweight, four-stage method that mitigates toxicity by identifying and removing the global toxic subspace from the parameters of FFN. Experiments across a range of LLMs show that GloSS achieves state-of-the-art detoxification performance while preserving the models general capabilities, without requiring large-scale data or model retraining.