π€ AI Summary
To address data scarcity and inconsistent labeling criteria for harmful content detection in low-resource settings, this paper proposes ToxiCraftβa framework that generates high-fidelity, diverse toxic texts from minimal seed data. Methodologically, ToxiCraft introduces a novel synthesis paradigm integrating semantic-controllable perturbation with toxicity-aligned distillation, combining prompt-driven generation, adversarial toxicity enhancement, consistency-based filtering, and lightweight discriminator-guided refinement. This design significantly improves model robustness against spurious features and cross-domain generalization. Experiments across multiple benchmarks demonstrate substantial gains in detection accuracy and robustness; generated samples achieve performance on par with human-annotated data, effectively reducing reliance on large-scale manual annotation.
π Abstract
In different NLP tasks, detecting harmful content is crucial for online environments, especially with the growing influence of social media. However, previous research has two main issues: 1) a lack of data in low-resource settings, and 2) inconsistent definitions and criteria for judging harmful content, requiring classification models to be robust to spurious features and diverse. We propose Toxicraft, a novel framework for synthesizing datasets of harmful information to address these weaknesses. With only a small amount of seed data, our framework can generate a wide variety of synthetic, yet remarkably realistic, examples of toxic information. Experimentation across various datasets showcases a notable enhancement in detection model robustness and adaptability, surpassing or close to the gold labels.