ToxicTone: A Mandarin Audio Dataset Annotated for Toxicity and Toxic Utterance Tonality

📅 2025-05-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The detection of toxic speech in Mandarin spoken audio suffers from a lack of annotated data and effective multimodal methods. Method: This paper introduces the first large-scale, manually annotated Chinese spoken audio dataset for toxicity detection and fine-grained toxic sentiment classification (e.g., anger, sarcasm, contempt), covering 13 realistic scenarios. It is the first work to systematically distinguish toxicity types from their underlying emotional origins. We propose an end-to-end multimodal framework integrating acoustic features (Whisper + Wav2Vec 2.0), emotion representations (Emotion2Vec), and textual features. Results: Experiments on our held-out test set show that our method achieves over 12% higher F1-score than text-only and unimodal baselines, demonstrating that prosodic cues—such as tone, speech rate, and pauses—are decisive for identifying implicit toxicity in Mandarin speech. This work fills a critical gap in spoken-language toxicity detection research.

Technology Category

Application Category

📝 Abstract
Despite extensive research on toxic speech detection in text, a critical gap remains in handling spoken Mandarin audio. The lack of annotated datasets that capture the unique prosodic cues and culturally specific expressions in Mandarin leaves spoken toxicity underexplored. To address this, we introduce ToxicTone -- the largest public dataset of its kind -- featuring detailed annotations that distinguish both forms of toxicity (e.g., profanity, bullying) and sources of toxicity (e.g., anger, sarcasm, dismissiveness). Our data, sourced from diverse real-world audio and organized into 13 topical categories, mirrors authentic communication scenarios. We also propose a multimodal detection framework that integrates acoustic, linguistic, and emotional features using state-of-the-art speech and emotion encoders. Extensive experiments show our approach outperforms text-only and baseline models, underscoring the essential role of speech-specific cues in revealing hidden toxic expressions.
Problem

Research questions and friction points this paper is trying to address.

Lack of annotated Mandarin audio datasets for toxicity detection
Underexplored prosodic and cultural cues in Mandarin toxic speech
Need for multimodal detection integrating acoustic, linguistic, and emotional features
Innovation

Methods, ideas, or system contributions that make the work stand out.

Largest annotated Mandarin audio dataset for toxicity
Multimodal framework with acoustic, linguistic, emotional features
Outperforms text-only models using speech-specific cues
Y
Yu-Xiang Luo
National Taiwan University, Taiwan
Yi-Cheng Lin
Yi-Cheng Lin
National Taiwan University
Speech ProcessingMachine LearningFairness
M
Ming-To Chuang
National Taiwan University, Taiwan
J
Jia-Hung Chen
National Taiwan University, Taiwan
I
I-Ning Tsai
National Taiwan University, Taiwan
P
Pei Xing Kiew
National Taiwan University, Taiwan
Y
Yueh-Hsuan Huang
National Taiwan University, Taiwan
C
Chien-Feng Liu
National Taiwan University, Taiwan
Y
Yu-Chen Chen
National Taiwan University, Taiwan
B
Bo-Han Feng
National Taiwan University, Taiwan
Wenze Ren
Wenze Ren
National Taiwan University; PHD @ Sinica Bio-ASP & NTU SPML Lab
Audio-visual
Hung-yi Lee
Hung-yi Lee
National Taiwan University
deep learningspoken language understandingspeech processing