🤖 AI Summary
Addressing the dual challenges of scarce high-quality labeled data and large language model (LLM) safety guardrails that impede the generation of offensive content in social media conflict behavior detection, this paper proposes PromptAug—a fine-grained data augmentation method integrating prompt engineering with social science paradigms. PromptAug circumvents LLM safety constraints to generate controllable, conflict-related textual instances and identifies four prototypical conflict patterns via thematic analysis. Comprehensive evaluation—including quantitative diversity metrics, extreme low-resource experiments, and qualitative analysis—demonstrates consistent improvements: +2% accuracy and +2% F1-score on both conflict and sentiment classification benchmarks. These gains significantly surpass those of existing augmentation methods, validating PromptAug’s effectiveness and generalizability for sensitive text classification and data-scarce scenarios.
📝 Abstract
Given the rise of conflicts on social media, effective classification models to detect harmful behaviours are essential. Following the garbage-in-garbage-out maxim, machine learning performance depends heavily on training data quality. However, high-quality labelled data, especially for nuanced tasks like identifying conflict behaviours, is limited, expensive, and difficult to obtain. Additionally, as social media platforms increasingly restrict access to research data, text data augmentation is gaining attention as an alternative to generate training data. Augmenting conflict-related data poses unique challenges due to Large Language Model (LLM) guardrails that prevent generation of offensive content. This paper introduces PromptAug, an innovative LLM-based data augmentation method. PromptAug achieves statistically significant improvements of 2% in both accuracy and F1-score on conflict and emotion datasets. To thoroughly evaluate PromptAug against other data augmentation methods we conduct a robust evaluation using extreme data scarcity scenarios, quantitative diversity analysis and a qualitative thematic analysis. The thematic analysis identifies four problematic patterns in augmented text: Linguistic Fluidity, Humour Ambiguity, Augmented Content Ambiguity, and Augmented Content Misinterpretation.
Overall, this work presents PromptAug as an effective method for augmenting data in sensitive tasks like conflict detection, offering a unique, interdisciplinary evaluation grounded in both natural language processing and social science methodology.