🤖 AI Summary
This paper addresses phonetic camouflage replacement (PCR) in Chinese—where offensive intent is concealed via homophonic or near-homophonic word substitutions—a critical challenge in content moderation. First, we construct the first natural PCR dataset of 500 real-world instances sourced from social media platforms and propose the first taxonomy of Chinese phonetic camouflage, categorizing surface forms into four types. Second, we design a pinyin-guided zero-shot detection method that integrates pinyin representations into the prompting mechanism, augmented by error-driven model refinement and chain-of-thought reasoning. Experiments show that state-of-the-art toxicity detectors achieve only an F1 score of 0.672 on our benchmark; our approach significantly improves performance, mitigating robustness degradation in zero-shot toxicity detection. This work establishes a new paradigm and provides the first dedicated benchmark for identifying implicit hate speech in Chinese.
📝 Abstract
Phonetic Cloaking Replacement (PCR), defined as the deliberate use of homophonic or near-homophonic variants to hide toxic intent, has become a major obstacle to Chinese content moderation. While this problem is well-recognized, existing evaluations predominantly rely on rule-based, synthetic perturbations that ignore the creativity of real users. We organize PCR into a four-way surface-form taxonomy and compile ours, a dataset of 500 naturally occurring, phonetically cloaked offensive posts gathered from the RedNote platform. Benchmarking state-of-the-art LLMs on this dataset exposes a serious weakness: the best model reaches only an F1-score of 0.672, and zero-shot chain-of-thought prompting pushes performance even lower. Guided by error analysis, we revisit a Pinyin-based prompting strategy that earlier studies judged ineffective and show that it recovers much of the lost accuracy. This study offers the first comprehensive taxonomy of Chinese PCR, a realistic benchmark that reveals current detectors' limits, and a lightweight mitigation technique that advances research on robust toxicity detection.