🤖 AI Summary
Current cyberbullying (CB) detection research suffers from a lack of scalable, ethically compliant, multi-turn dialogue datasets with fine-grained annotations. Method: We introduce the first large-scale synthetic multi-turn CB dialogue dataset, generated collaboratively by multiple large language models (LLMs) via prompt engineering and role-playing to ensure contextual coherence across bullying and non-bullying scenarios. We propose a novel contextualized annotation schema capturing intent, discourse dynamics, and harm intensity—enabling ethical modeling without real-user data. Contribution/Results: Evaluation across five dimensions—including realism and diversity—demonstrates superior quality. The dataset supports both standalone model training and effective augmentation of existing CB detectors, yielding significant improvements in classification performance on benchmark CB detection tasks.
📝 Abstract
We introduce SynBullying, a synthetic multi-LLM conversational dataset for studying and detecting cyberbullying (CB). SynBullying provides a scalable and ethically safe alternative to human data collection by leveraging large language models (LLMs) to simulate realistic bullying interactions. The dataset offers (i) conversational structure, capturing multi-turn exchanges rather than isolated posts; (ii) context-aware annotations, where harmfulness is assessed within the conversational flow considering context, intent, and discourse dynamics; and (iii) fine-grained labeling, covering various CB categories for detailed linguistic and behavioral analysis. We evaluate SynBullying across five dimensions, including conversational structure, lexical patterns, sentiment/toxicity, role dynamics, harm intensity, and CB-type distribution. We further examine its utility by testing its performance as standalone training data and as an augmentation source for CB classification.