GloSS over Toxicity: Understanding and Mitigating Toxicity in LLMs via Global Toxic Subspace

📅 2025-05-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work uncovers the intrinsic mechanism underlying toxic generation in large language models (LLMs), revealing that a global toxic subspace—spanned across layers and parameters—is more fundamental and comprehensive than layer-wise or vector-level representations. To address this, we propose GloSS, a lightweight, four-stage detoxification method: it identifies the global toxic subspace within feed-forward network (FFN) parameters via principal component analysis (PCA), then performs parameter-space intervention through orthogonal projection and clipping. GloSS requires no fine-tuning, large-scale detoxification data, or retraining, establishing the first zero-shot, low-overhead subspace-level detoxification paradigm. Evaluated across diverse LLMs, GloSS reduces toxicity detection accuracy on ToxiGen by over 40%—significantly outperforming prior methods—while preserving over 98% of general capabilities, as measured by MMLU and BBH benchmarks.

Technology Category

Application Category

📝 Abstract
This paper investigates the underlying mechanisms of toxicity generation in Large Language Models (LLMs) and proposes an effective detoxification approach. Prior work typically considers the Feed-Forward Network (FFN) as the main source of toxicity, representing toxic regions as a set of toxic vectors or layer-wise subspaces. However, our in-depth analysis reveals that the global toxic subspace offers a more effective and comprehensive representation of toxic region within the model. Building on this insight, we propose GloSS (Global Toxic Subspace Suppression), a lightweight, four-stage method that mitigates toxicity by identifying and removing the global toxic subspace from the parameters of FFN. Experiments across a range of LLMs show that GloSS achieves state-of-the-art detoxification performance while preserving the models general capabilities, without requiring large-scale data or model retraining.
Problem

Research questions and friction points this paper is trying to address.

Investigates toxicity generation mechanisms in LLMs
Proposes global toxic subspace for detoxification
Introduces GloSS method to suppress toxicity effectively
Innovation

Methods, ideas, or system contributions that make the work stand out.

Identifies global toxic subspace in LLMs
Proposes lightweight four-stage GloSS method
Removes toxic subspace without retraining
🔎 Similar Papers
No similar papers found.
Zenghao Duan
Zenghao Duan
CAS Key Laboratory of AI Safety, Institute of Computing Technology, CAS
large language model
Z
Zhiyi Yin
Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
Zhichao Shi
Zhichao Shi
School of Advanced Interdisciplinary; Institute of Computing Technology, Chinese Academy of Sciences
Liang Pang
Liang Pang
Associate Professor, Institute of Computing Technology, Chinese Academy of Sciences
Large Language ModelSemantic MatchingQuestion AnsweringText MatchingText Generation
S
Shaoling Jing
Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
J
Jiayi Wu
Dalian University of Technology, Liaoning, China
Y
Yu Yan
Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China; University of Chinese Academy of Sciences, Beijing, China
H
Huawei Shen
Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
Xueqi Cheng
Xueqi Cheng
Ph.D. student, Florida State University
Data miningLLMGNNComputational social science