SynBullying: A Multi LLM Synthetic Conversational Dataset for Cyberbullying Detectio

📅 2025-10-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current cyberbullying (CB) detection research suffers from a lack of scalable, ethically compliant, multi-turn dialogue datasets with fine-grained annotations. Method: We introduce the first large-scale synthetic multi-turn CB dialogue dataset, generated collaboratively by multiple large language models (LLMs) via prompt engineering and role-playing to ensure contextual coherence across bullying and non-bullying scenarios. We propose a novel contextualized annotation schema capturing intent, discourse dynamics, and harm intensity—enabling ethical modeling without real-user data. Contribution/Results: Evaluation across five dimensions—including realism and diversity—demonstrates superior quality. The dataset supports both standalone model training and effective augmentation of existing CB detectors, yielding significant improvements in classification performance on benchmark CB detection tasks.

Technology Category

Application Category

📝 Abstract
We introduce SynBullying, a synthetic multi-LLM conversational dataset for studying and detecting cyberbullying (CB). SynBullying provides a scalable and ethically safe alternative to human data collection by leveraging large language models (LLMs) to simulate realistic bullying interactions. The dataset offers (i) conversational structure, capturing multi-turn exchanges rather than isolated posts; (ii) context-aware annotations, where harmfulness is assessed within the conversational flow considering context, intent, and discourse dynamics; and (iii) fine-grained labeling, covering various CB categories for detailed linguistic and behavioral analysis. We evaluate SynBullying across five dimensions, including conversational structure, lexical patterns, sentiment/toxicity, role dynamics, harm intensity, and CB-type distribution. We further examine its utility by testing its performance as standalone training data and as an augmentation source for CB classification.
Problem

Research questions and friction points this paper is trying to address.

Synthetic dataset for cyberbullying detection
Simulates realistic multi-turn bullying conversations
Enables scalable and ethical data collection
Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthetic dataset generated by multiple large language models
Context-aware annotations within conversational flow
Fine-grained labeling for detailed cyberbullying analysis
🔎 Similar Papers
No similar papers found.
Arefeh Kazemi
Arefeh Kazemi
Dublin City University
Natural Language ProcessingMachine TranslationQuestion Answering
H
Hamza Qadeer
School of Computing, ADAPT Centre, Dublin City University, Dublin, Ireland
Joachim Wagner
Joachim Wagner
ADAPT Centre, NCLT, School of Computing, Dublin City University
Natural Language Processing
Hossein Hosseini
Hossein Hosseini
University of Isfahan, Isfahan, Iran
S
Sri Balaaji Natarajan Kalaivendan
School of Computing, ADAPT Centre, Dublin City University, Dublin, Ireland
B
Brian Davis
School of Computing, ADAPT Centre, Dublin City University, Dublin, Ireland