🤖 AI Summary
Current Chinese large language models exhibit significant vulnerability to undetected implicit toxic content that has been superficially obfuscated. This work proposes CITA, a novel framework that systematically constructs the first controllable generation and evaluation pipeline for implicit toxic attacks in Chinese through three stages: harmful intent learning, implicit toxicity augmentation, and obfuscated variant rewriting. CITA-generated samples achieve an average attack success rate of 69.48% across seven mainstream toxicity detectors, starkly revealing the fragility of existing defense mechanisms. Furthermore, fine-tuning a toxicity detection model (CITD) on this red-teaming data substantially enhances its robustness against implicit toxic content, demonstrating the critical role of high-quality adversarial data in strengthening defensive capabilities.
📝 Abstract
Large language models (LLMs) require robust toxicity evaluation beyond explicit wording. This setting remains underexplored in Chinese, where toxicity may combine semantic indirectness with surface obfuscation. We introduce Chinese Implicit Toxicity Attack (CITA), a controlled red-team evaluation and defense-data generation framework, not a deployable evasion tool. CITA uses three stages: (i) Harmful Intent Learning, (ii) Implicit Toxicity Enhancement, and (iii) Obfuscation Variant Rewriting, to preserve harmful intent, increase implicitness, and add controlled surface variants. On CITA-generated evaluation samples, the seven tested detectors exhibit substantial missed-detection risks, reaching an average ASR of 69.48%; human evaluation further confirms preserved harmfulness and increased implicitness/evasiveness. As a downstream defense application, we fine-tune a Chinese Implicit Toxicity Defense model (CITD) with CITA-generated red-team data, showing that such data can improve robustness through additional training.