DAPI: Domain Adaptive Toxicity Probe Vector Intervention for Fine-Grained Detoxification

📅 2025-03-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing toxicity mitigation methods predominantly rely on a single probe vector, limiting their capacity for fine-grained identification and intervention across diverse toxicity categories—such as bias, insult, and illegality. To address this, we propose a domain-adaptive, category-specific toxicity probing and intervention framework: it trains linear probes to derive subcategory-specific probe vectors, then integrates dynamic context matching, vector-space projection-based intervention, and real-time gradient correction to achieve context-aware, targeted toxicity suppression. This approach overcomes the fundamental limitation of single-vector representations in capturing multidimensional toxicity, enabling the first fine-grained, configurable toxicity control. Evaluated on standard benchmarks, our method achieves up to 78.52% toxicity reduction while preserving language fluency—degrading it by only 0.052%. It significantly outperforms existing baselines in both efficacy and fidelity.

Technology Category

Application Category

📝 Abstract
There have been attempts to utilize linear probe for detoxification, with existing studies relying on a single toxicity probe vector to reduce toxicity. However, toxicity can be fine-grained into various subcategories, making it difficult to remove certain types of toxicity by using a single toxicity probe vector. To address this limitation, we propose a category-specific toxicity probe vector approach. First, we train multiple toxicity probe vectors for different toxicity categories. During generation, we dynamically select the most relevant toxicity probe vector based on the current context. Finally, the selected vector is dynamically scaled and subtracted from model. Our method successfully mitigated toxicity from categories that the single probe vector approach failed to detoxify. Experiments demonstrate that our approach achieves up to a 78.52% reduction in toxicity on the evaluation dataset, while fluency remains nearly unchanged, with only a 0.052% drop compared to the unsteered model.
Problem

Research questions and friction points this paper is trying to address.

Fine-grained detoxification using multiple toxicity probe vectors
Dynamic selection and scaling of relevant toxicity probe vectors
Reduction of specific toxicity categories without losing text fluency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multiple category-specific toxicity probe vectors
Dynamic selection of relevant probe vectors
Dynamic scaling and subtraction from model
🔎 Similar Papers
No similar papers found.