SynBullying: A Multi LLM Synthetic Conversational Dataset for Cyberbullying Detectio

📅 2025-10-30

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

Current cyberbullying (CB) detection research suffers from a lack of scalable, ethically compliant, multi-turn dialogue datasets with fine-grained annotations. Method: We introduce the first large-scale synthetic multi-turn CB dialogue dataset, generated collaboratively by multiple large language models (LLMs) via prompt engineering and role-playing to ensure contextual coherence across bullying and non-bullying scenarios. We propose a novel contextualized annotation schema capturing intent, discourse dynamics, and harm intensity—enabling ethical modeling without real-user data. Contribution/Results: Evaluation across five dimensions—including realism and diversity—demonstrates superior quality. The dataset supports both standalone model training and effective augmentation of existing CB detectors, yielding significant improvements in classification performance on benchmark CB detection tasks.

Technology Category

Application Category

📝 Abstract

We introduce SynBullying, a synthetic multi-LLM conversational dataset for studying and detecting cyberbullying (CB). SynBullying provides a scalable and ethically safe alternative to human data collection by leveraging large language models (LLMs) to simulate realistic bullying interactions. The dataset offers (i) conversational structure, capturing multi-turn exchanges rather than isolated posts; (ii) context-aware annotations, where harmfulness is assessed within the conversational flow considering context, intent, and discourse dynamics; and (iii) fine-grained labeling, covering various CB categories for detailed linguistic and behavioral analysis. We evaluate SynBullying across five dimensions, including conversational structure, lexical patterns, sentiment/toxicity, role dynamics, harm intensity, and CB-type distribution. We further examine its utility by testing its performance as standalone training data and as an augmentation source for CB classification.

Problem

Research questions and friction points this paper is trying to address.

Synthetic dataset for cyberbullying detection

Simulates realistic multi-turn bullying conversations

Enables scalable and ethical data collection

Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthetic dataset generated by multiple large language models

Context-aware annotations within conversational flow

Fine-grained labeling for detailed cyberbullying analysis

🔎 Similar Papers

No similar papers found.