Chinese Toxic Language Mitigation via Sentiment Polarity Consistent Rewrites

📅 2025-05-21

📈 Citations: 0

✨ Influential: 0

career value

163K/year

🤖 AI Summary

This work addresses the challenge of preserving original intent and sentiment polarity while detoxifying implicit toxicity in Chinese online text—such as offensive implications conveyed via emojis, homophonic substitutions, or conversational context. We introduce ToxiRewriteCN, the first sentiment-polarity-aligned Chinese detoxification rewriting dataset. It comprises high-quality, real-world triplets (toxic input, toxicity span annotation, sentiment-aligned rewrite) across five scenario categories and incorporates an explicit sentiment consistency constraint—a novel contribution. Comprehensive evaluation across 17 state-of-the-art large language models—including Mixture-of-Experts architectures—reveals that existing models consistently sacrifice sentiment fidelity to achieve safety, exhibiting notably poor sentiment polarity alignment under implicit toxicity. ToxiRewriteCN is publicly released to advance controllable, sentiment-aware Chinese detoxification research.

Technology Category

Application Category

📝 Abstract

Detoxifying offensive language while preserving the speaker's original intent is a challenging yet critical goal for improving the quality of online interactions. Although large language models (LLMs) show promise in rewriting toxic content, they often default to overly polite rewrites, distorting the emotional tone and communicative intent. This problem is especially acute in Chinese, where toxicity often arises implicitly through emojis, homophones, or discourse context. We present ToxiRewriteCN, the first Chinese detoxification dataset explicitly designed to preserve sentiment polarity. The dataset comprises 1,556 carefully annotated triplets, each containing a toxic sentence, a sentiment-aligned non-toxic rewrite, and labeled toxic spans. It covers five real-world scenarios: standard expressions, emoji-induced and homophonic toxicity, as well as single-turn and multi-turn dialogues. We evaluate 17 LLMs, including commercial and open-source models with variant architectures, across four dimensions: detoxification accuracy, fluency, content preservation, and sentiment polarity. Results show that while commercial and MoE models perform best overall, all models struggle to balance safety with emotional fidelity in more subtle or context-heavy settings such as emoji, homophone, and dialogue-based inputs. We release ToxiRewriteCN to support future research on controllable, sentiment-aware detoxification for Chinese.

Problem

Research questions and friction points this paper is trying to address.

Mitigating toxic Chinese language while preserving sentiment polarity

Addressing implicit toxicity in emojis, homophones, and discourse context

Evaluating LLMs for balanced detoxification and emotional fidelity

Innovation

Methods, ideas, or system contributions that make the work stand out.

First Chinese detoxification dataset preserving sentiment polarity

Evaluates 17 LLMs across four key dimensions

Covers five real-world toxic language scenarios

🔎 Similar Papers

Mitigating Text Toxicity with Counterfactual Generation