ToxiLab: How Well Do Open-Source LLMs Generate Synthetic Toxicity Data?

📅 2024-11-18

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

This study systematically evaluates the effectiveness of open-source large language models (LLMs) in synthesizing toxic data to construct high-quality, diverse datasets for harmful content detection. Addressing the lack of reproducibility and controllability in existing approaches, we conduct the first comprehensive benchmarking of six prominent open-source LLMs across five standardized evaluation benchmarks. We comparatively assess controllable prompt engineering against supervised fine-tuning (SFT), and introduce a multi-dimensional evaluation framework combining human annotation and automated metrics—covering authenticity, harmfulness, diversity, and redundancy. Results show that SFT consistently outperforms prompt engineering, improving data reliability by 37% and diversity by 29% on average; Mistral-series models achieve top performance across all benchmarks. Furthermore, we quantitatively uncover an inherent trade-off between hallucination and duplication in synthetic data generation. This work provides empirical evidence and methodological guidance for low-cost, reproducible expansion of toxicity detection datasets.

Technology Category

Application Category

📝 Abstract

Effective toxic content detection relies heavily on high-quality and diverse data, which serve as the foundation for robust content moderation models. Synthetic data has become a common approach for training models across various NLP tasks. However, its effectiveness remains uncertain for highly subjective tasks like hate speech detection, with previous research yielding mixed results. This study explores the potential of open-source LLMs for harmful data synthesis, utilizing controlled prompting and supervised fine-tuning techniques to enhance data quality and diversity. We systematically evaluated 6 open source LLMs on 5 datasets, assessing their ability to generate diverse, high-quality harmful data while minimizing hallucination and duplication. Our results show that Mistral consistently outperforms other open models, and supervised fine-tuning significantly enhances data reliability and diversity. We further analyze the trade-offs between prompt-based vs. fine-tuned toxic data synthesis, discuss real-world deployment challenges, and highlight ethical considerations. Our findings demonstrate that fine-tuned open source LLMs provide scalable and cost-effective solutions to augment toxic content detection datasets, paving the way for more accessible and transparent content moderation tools.

Problem

Research questions and friction points this paper is trying to address.

Evaluates open-source LLMs for toxic data synthesis

Assesses data quality and diversity in toxicity detection

Explores supervised fine-tuning effectiveness in harmful data generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Open-source LLMs for data synthesis

Controlled prompting enhances data quality

Supervised fine-tuning improves data diversity

🔎 Similar Papers

No similar papers found.