Exploring Multimodal Challenges in Toxic Chinese Detection: Taxonomy, Benchmark, and Findings

📅 2025-05-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) exhibit poor robustness against perturbations—such as character substitution—in Chinese multimodal toxic content detection. Method: We introduce the first Chinese multimodal toxicity benchmark dataset covering three perturbation categories (character substitution, visual noise, semantic deformation) and eight specific attack techniques; propose the first taxonomy for Chinese toxic multimodal perturbations; and systematically evaluate nine state-of-the-art Chinese and English LLMs under zero-shot, in-context learning (ICL), and supervised fine-tuning (SFT) paradigms. Contribution/Results: All SOTA models suffer >40% average accuracy drop on perturbed inputs; even minimal perturbation-based fine-tuning or prompting induces up to 35% false positives on benign samples—revealing a novel “over-correction” phenomenon in ICL/SFT. This work establishes a reproducible benchmark and a new robustness evaluation paradigm for Chinese multimodal content safety.

Technology Category

Application Category

📝 Abstract
Detecting toxic content using language models is important but challenging. While large language models (LLMs) have demonstrated strong performance in understanding Chinese, recent studies show that simple character substitutions in toxic Chinese text can easily confuse the state-of-the-art (SOTA) LLMs. In this paper, we highlight the multimodal nature of Chinese language as a key challenge for deploying LLMs in toxic Chinese detection. First, we propose a taxonomy of 3 perturbation strategies and 8 specific approaches in toxic Chinese content. Then, we curate a dataset based on this taxonomy, and benchmark 9 SOTA LLMs (from both the US and China) to assess if they can detect perturbed toxic Chinese text. Additionally, we explore cost-effective enhancement solutions like in-context learning (ICL) and supervised fine-tuning (SFT). Our results reveal two important findings. (1) LLMs are less capable of detecting perturbed multimodal Chinese toxic contents. (2) ICL or SFT with a small number of perturbed examples may cause the LLMs"overcorrect'': misidentify many normal Chinese contents as toxic.
Problem

Research questions and friction points this paper is trying to address.

Detecting perturbed toxic Chinese text challenges SOTA LLMs
Multimodal Chinese language complicates toxic content detection
ICL and SFT may cause overcorrection in LLM toxicity detection
Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposed taxonomy for toxic Chinese perturbations
Benchmarked 9 SOTA LLMs on perturbed text
Explored ICL and SFT for cost-effective enhancements
🔎 Similar Papers
No similar papers found.
S
Shujian Yang
Shanghai Jiao Tong University, China
Shiyao Cui
Shiyao Cui
Tsinghua University
C
Chuanrui Hu
Qihoo 360, China
H
Haicheng Wang
Shanghai Jiao Tong University, China
T
Tianwei Zhang
Nanyang Technological University, Singapore
M
Minlie Huang
Tsinghua University, China
J
Jialiang Lu
Shanghai Jiao Tong University, China
Han Qiu
Han Qiu
NTU