Mitigating Text Toxicity with Counterfactual Generation

📅 2024-05-16
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses text detoxification—removing offensive or harmful content while preserving the original non-toxic semantics. We propose a counterfactual generation–based detoxification method, the first to adapt counterfactual explanation techniques from interpretable AI to this task. Guided by local feature importance analysis derived from a toxicity classifier, our approach generates reconstructed texts with high semantic fidelity and significantly reduced toxicity. The method explicitly models toxicity polysemy and assesses risks of tool misuse (e.g., adversarial editing). We design a hybrid evaluation framework integrating automated metrics and human evaluation for robust validation. Extensive experiments on three standard benchmarks demonstrate that our method consistently outperforms three categories of state-of-the-art baselines, achieving superior performance in both toxicity removal rate and semantic preservation.

Technology Category

Application Category

📝 Abstract
Toxicity mitigation consists in rephrasing text in order to remove offensive or harmful meaning. Neural natural language processing (NLP) models have been widely used to target and mitigate textual toxicity. However, existing methods fail to detoxify text while preserving the initial non-toxic meaning at the same time. In this work, we propose to apply counterfactual generation methods from the eXplainable AI (XAI) field to target and mitigate textual toxicity. In particular, we perform text detoxification by applying local feature importance and counterfactual generation methods to a toxicity classifier distinguishing between toxic and non-toxic texts. We carry out text detoxification through counterfactual generation on three datasets and compare our approach to three competitors. Automatic and human evaluations show that recently developed NLP counterfactual generators can mitigate toxicity accurately while better preserving the meaning of the initial text as compared to classical detoxification methods. Finally, we take a step back from using automated detoxification tools, and discuss how to manage the polysemous nature of toxicity and the risk of malicious use of detoxification tools. This work is the first to bridge the gap between counterfactual generation and text detoxification and paves the way towards more practical application of XAI methods.
Problem

Research questions and friction points this paper is trying to address.

Mitigating text toxicity while preserving original meaning
Applying counterfactual generation for effective detoxification
Addressing polysemous toxicity and risks of detoxification tools
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses counterfactual generation for detoxification
Applies XAI methods to toxicity classifiers
Preserves non-toxic meaning during detoxification
🔎 Similar Papers
No similar papers found.
M
Milan Bhan
Ekimetrics, Sorbonne Université, LIP6
J
Jean-Noël Vittaut
Sorbonne Université, LIP6
N
Nina Achache
Ekimetrics
V
Victor Legrand
Ekimetrics
N
N. Chesneau
Ekimetrics
A
A. Blangero
Ekimetrics, Aix-Marseille Université
J
Juliette Murris
Université Paris Cité
Marie-Jeanne Lesot
Marie-Jeanne Lesot
LIP6, Sorbonne Université