🤖 AI Summary
Existing toxicity mitigation methods predominantly rely on a single probe vector, limiting their capacity for fine-grained identification and intervention across diverse toxicity categories—such as bias, insult, and illegality. To address this, we propose a domain-adaptive, category-specific toxicity probing and intervention framework: it trains linear probes to derive subcategory-specific probe vectors, then integrates dynamic context matching, vector-space projection-based intervention, and real-time gradient correction to achieve context-aware, targeted toxicity suppression. This approach overcomes the fundamental limitation of single-vector representations in capturing multidimensional toxicity, enabling the first fine-grained, configurable toxicity control. Evaluated on standard benchmarks, our method achieves up to 78.52% toxicity reduction while preserving language fluency—degrading it by only 0.052%. It significantly outperforms existing baselines in both efficacy and fidelity.
📝 Abstract
There have been attempts to utilize linear probe for detoxification, with existing studies relying on a single toxicity probe vector to reduce toxicity. However, toxicity can be fine-grained into various subcategories, making it difficult to remove certain types of toxicity by using a single toxicity probe vector. To address this limitation, we propose a category-specific toxicity probe vector approach. First, we train multiple toxicity probe vectors for different toxicity categories. During generation, we dynamically select the most relevant toxicity probe vector based on the current context. Finally, the selected vector is dynamically scaled and subtracted from model. Our method successfully mitigated toxicity from categories that the single probe vector approach failed to detoxify. Experiments demonstrate that our approach achieves up to a 78.52% reduction in toxicity on the evaluation dataset, while fluency remains nearly unchanged, with only a 0.052% drop compared to the unsteered model.