Synthetic vs. Gold: The Role of LLM-Generated Labels and Data in Cyberbullying Detection

📅 2025-02-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates whether synthetic data generated by large language models (LLMs) can effectively substitute for or augment real labeled data to improve cyberbullying detection. We systematically generate high-quality synthetic texts and labels using diverse state-of-the-art LLMs—including GPT, Llama, and Qwen—and train binary classifiers within standard NLP preprocessing and supervised learning pipelines. Results show that models trained solely on synthetic data achieve performance nearly matching that of real-data baselines; furthermore, hybrid training on real and synthetic data consistently improves accuracy by 2.1–3.8% across all three LLM families. This study provides the first empirical validation of LLM-generated synthetic data for cyberbullying detection, reveals critical sensitivity to LLM selection, and establishes synthetic data augmentation as a scalable, privacy-preserving, and ethically compliant paradigm for dataset enhancement.

Technology Category

Application Category

📝 Abstract
This study investigates the role of LLM-generated synthetic data in cyberbullying detection. We conduct a series of experiments where we replace some or all of the authentic data with synthetic data, or augment the authentic data with synthetic data. We find that synthetic cyberbullying data can be the basis for training a classifier for harm detection that reaches performance close to that of a classifier trained with authentic data. Combining authentic with synthetic data shows improvements over the baseline of training on authentic data alone for the test data for all three LLMs tried. These results highlight the viability of synthetic data as a scalable, ethically viable alternative in cyberbullying detection while emphasizing the critical impact of LLM selection on performance outcomes.
Problem

Research questions and friction points this paper is trying to address.

LLM-generated synthetic data in cyberbullying
Synthetic vs. authentic data performance
Ethical scalability in cyberbullying detection
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-generated synthetic data
augment authentic data
ethical scalable alternative
🔎 Similar Papers
No similar papers found.
Arefeh Kazemi
Arefeh Kazemi
Dublin City University
Natural Language ProcessingMachine TranslationQuestion Answering
S
Sri Balaaji Natarajan Kalaivendan
School of Computing, ADAPT Centre, Dublin City University, Dublin, Ireland
Joachim Wagner
Joachim Wagner
ADAPT Centre, NCLT, School of Computing, Dublin City University
Natural Language Processing
H
Hamza Qadeer
School of Computing, ADAPT Centre, Dublin City University, Dublin, Ireland
B
Brian Davis
School of Computing, ADAPT Centre, Dublin City University, Dublin, Ireland