Redefining Experts: Interpretable Decomposition of Language Models for Toxicity Mitigation

📅 2025-09-20

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

Large language models (LLMs) often generate harmful content, posing serious risks to AI safety and public trust. Existing neuron-level intervention methods suffer from poor stability, strong context dependency, and unintended degradation of linguistic capabilities. To address these limitations, we propose EigenShift—a fine-tuning-free, low-overhead layer-wise intervention framework. EigenShift decouples the toxicity generation mechanism via inter-layer feature aggregation and output-layer feature decomposition. It innovatively separates toxicity detection and generation into distinct expert modules, enabling structured intervention through eigenvalue decomposition and generation-alignment analysis. Evaluated on the Jigsaw and ToxiCN benchmarks, EigenShift achieves significant and robust suppression of toxic outputs while preserving language modeling performance. The method demonstrates strong interpretability, cross-dataset generalizability, and deployment efficiency—offering a practical, principled solution for safe LLM inference.

Technology Category

Application Category

📝 Abstract

Large Language Models have demonstrated impressive fluency across diverse tasks, yet their tendency to produce toxic content remains a critical challenge for AI safety and public trust. Existing toxicity mitigation approaches primarily manipulate individual neuron activations, but these methods suffer from instability, context dependence, and often compromise the model's core language abilities. To address these shortcomings, we investigate three key questions: the stability of neuron-level toxicity indicators, the advantages of structural (layer-wise) representations, and the interpretability of mechanisms driving toxic generation. Through extensive experiments on Jigsaw and ToxiCN datasets, we show that aggregated layer-wise features provide more robust signals than single neurons. Moreover, we observe conceptual limitations in prior works that conflate toxicity detection experts and generation experts within neuron-based interventions. To mitigate this, we propose a novel principled intervention technique, EigenShift, based on eigen-decomposition of the language model's final output layer. This method selectively targets generation-aligned components, enabling precise toxicity suppression without impairing linguistic competence. Our method requires no additional training or fine-tuning, incurs minimal computational cost, and is grounded in rigorous theoretical analysis.

Problem

Research questions and friction points this paper is trying to address.

Mitigating toxic content generation in large language models

Addressing instability and context dependence in neuron-based interventions

Separating toxicity detection from generation mechanisms for precise control

Innovation

Methods, ideas, or system contributions that make the work stand out.

Layer-wise features provide robust toxicity signals

EigenShift targets generation components via eigen-decomposition

Intervention requires no training and preserves language ability

🔎 Similar Papers

Mitigating Text Toxicity with Counterfactual Generation