🤖 AI Summary
This work addresses critical limitations in current knowledge-editing-based detoxification methods, which rely on automatic toxicity classifiers that fail to accurately measure the suppression of harmful behaviors and lack systematic robustness evaluation. The authors propose a multidimensional evaluation framework—assessing optimization stability, compositional generalization, and cross-lingual transfer—and reveal, for the first time, a “pseudo-detoxification” phenomenon: apparent reductions in toxicity scores stem not from genuine content purification but from degraded generation quality. Through comprehensive experiments, they demonstrate that existing approaches are effective only under narrow conditions—specific models, limited editing targets, and select languages—thereby delineating clear boundaries and limitations. This study establishes a reliable benchmark and offers principled guidance for future detoxification research.
📝 Abstract
Knowledge-Editing-based (KE-based) detoxification has emerged as a promising approach for mitigating harmful behaviours in Large Language Models. Existing evaluations, however, largely rely on automatic toxicity classifiers, implicitly assuming that reduced toxicity scores reflect genuine behavioural suppression. In this work, we propose a robustness-oriented evaluation framework for KE-based detoxification that examines its reliability beyond standard classifier-based metrics along three dimensions: optimisation robustness, compositional robustness, and cross-lingual robustness. We identify pseudo-detoxification as a common failure mode, where apparent toxicity reductions arise from degenerate generation behaviours rather than meaningful suppression of unsafe content. We further show that detoxification effectiveness degrades when multiple unsafe behaviours are edited jointly, and that both monolingual and cross-lingual detoxification remain effective only under specific model-method combinations. Overall, our results indicate that KE-based detoxification is robust only for certain models, limited numbers of detoxification objectives, and a subset of languages.