🤖 AI Summary
Existing social media sensitive content detection tools suffer from limited customizability, narrow category coverage—particularly lacking long-tail classes such as drug-related and self-harm content—high privacy risks, and the absence of a unified evaluation benchmark. To address these issues, this work introduces the first high-quality, uniformly annotated dataset covering six sensitive content categories: conflict language, abuse, pornography, drug-related content, self-harm, and spam. We establish standardized protocols for data collection and human annotation. Leveraging this dataset, we supervise fine-tuning of open-source large language models (e.g., LLaMA) and design a comprehensive, multi-dimensional evaluation benchmark. Experimental results demonstrate that our approach consistently outperforms both the LLaMA baseline and the OpenAI API across all six detection tasks, achieving average improvements of 10–15%. Gains are especially pronounced for scarce categories (e.g., drug-related and self-harm content), validating the effectiveness and deployability of open-source LLM fine-tuning for fine-grained sensitive content identification.
📝 Abstract
The detection of sensitive content in large datasets is crucial for ensuring that shared and analysed data is free from harmful material. However, current moderation tools, such as external APIs, suffer from limitations in customisation, accuracy across diverse sensitive categories, and privacy concerns. Additionally, existing datasets and open-source models focus predominantly on toxic language, leaving gaps in detecting other sensitive categories such as substance abuse or self-harm. In this paper, we put forward a unified dataset tailored for social media content moderation across six sensitive categories: conflictual language, profanity, sexually explicit material, drug-related content, self-harm, and spam. By collecting and annotating data with consistent retrieval strategies and guidelines, we address the shortcomings of previous focalised research. Our analysis demonstrates that fine-tuning large language models (LLMs) on this novel dataset yields significant improvements in detection performance compared to open off-the-shelf models such as LLaMA, and even proprietary OpenAI models, which underperform by 10-15% overall. This limitation is even more pronounced on popular moderation APIs, which cannot be easily tailored to specific sensitive content categories, among others.