DAPI: Domain Adaptive Toxicity Probe Vector Intervention for Fine-Grained Detoxification

📅 2025-03-17
📈 Citations: 0
Influential: 0
📄 PDF

career value

197K/year
🤖 AI Summary
Existing toxicity mitigation methods predominantly rely on a single probe vector, limiting their capacity for fine-grained identification and intervention across diverse toxicity categories—such as bias, insult, and illegality. To address this, we propose a domain-adaptive, category-specific toxicity probing and intervention framework: it trains linear probes to derive subcategory-specific probe vectors, then integrates dynamic context matching, vector-space projection-based intervention, and real-time gradient correction to achieve context-aware, targeted toxicity suppression. This approach overcomes the fundamental limitation of single-vector representations in capturing multidimensional toxicity, enabling the first fine-grained, configurable toxicity control. Evaluated on standard benchmarks, our method achieves up to 78.52% toxicity reduction while preserving language fluency—degrading it by only 0.052%. It significantly outperforms existing baselines in both efficacy and fidelity.

Technology Category

Application Category

📝 Abstract
There have been attempts to utilize linear probe for detoxification, with existing studies relying on a single toxicity probe vector to reduce toxicity. However, toxicity can be fine-grained into various subcategories, making it difficult to remove certain types of toxicity by using a single toxicity probe vector. To address this limitation, we propose a category-specific toxicity probe vector approach. First, we train multiple toxicity probe vectors for different toxicity categories. During generation, we dynamically select the most relevant toxicity probe vector based on the current context. Finally, the selected vector is dynamically scaled and subtracted from model. Our method successfully mitigated toxicity from categories that the single probe vector approach failed to detoxify. Experiments demonstrate that our approach achieves up to a 78.52% reduction in toxicity on the evaluation dataset, while fluency remains nearly unchanged, with only a 0.052% drop compared to the unsteered model.
Problem

Research questions and friction points this paper is trying to address.

Fine-grained detoxification using multiple toxicity probe vectors
Dynamic selection and scaling of relevant toxicity probe vectors
Reduction of specific toxicity categories without losing text fluency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multiple category-specific toxicity probe vectors
Dynamic selection of relevant probe vectors
Dynamic scaling and subtraction from model
💼 Related Jobs
Postdoctoral Fellow – AI-Driven Multi-Omics Integration for Predictive Toxicology
Pfizer
The annual base salary for this position ranges from $64,600.00 to $107,600.00. In addition, this position is eligible for participation in Pfizer’s Global Performance Plan with a bonus target of 7.5% of the base salary. We offer comprehensive and generous benefits and programs to help our colleagues lead healthy lives and to support each of life’s moments. Benefits offered include a 401(k) plan with Pfizer Matching Contributions and an additional Pfizer Retirement Savings Contribution, paid vacation, holiday and personal days, paid caregiver/parental and medical leave, and health benefits to include medical, prescription drug, dental and vision coverage. Learn more at Pfizer Candidate Site – U.S. Benefits | (uscandidates.mypfizerbenefits.com). Pfizer compensation structures and benefit packages are aligned based on the location of hire. The United States salary range provided does not apply to Tampa, FL or any location outside of the United States. Relocation assistance may be available based on business needs and/or eligibility.
Hybrid