ToxiGAN: Toxic Data Augmentation via LLM-Guided Directional Adversarial Generation

πŸ“… 2026-01-06
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the limitations of existing toxic language classification methods, which suffer from a lack of class-controllable text augmentation techniques that hinder model robustness. To overcome this, the authors propose ToxiGAN, a class-aware text augmentation framework that integrates a generative adversarial network (GAN) architecture with semantic guidance from large language models (LLMs). Through a two-stage targeted adversarial training process, ToxiGAN dynamically selects LLM-generated neutral texts as semantic anchors to steer the generation of high-fidelity, class-specific toxic samples, effectively mitigating mode collapse and semantic drift. Evaluated on four hate speech benchmarks, ToxiGAN significantly outperforms current data augmentation approaches in both macro-F1 and hate-F1 metrics, demonstrating its effectiveness and novelty.

Technology Category

Application Category

πŸ“ Abstract
Augmenting toxic language data in a controllable and class-specific manner is crucial for improving robustness in toxicity classification, yet remains challenging due to limited supervision and distributional skew. We propose ToxiGAN, a class-aware text augmentation framework that combines adversarial generation with semantic guidance from large language models (LLMs). To address common issues in GAN-based augmentation such as mode collapse and semantic drift, ToxiGAN introduces a two-step directional training strategy and leverages LLM-generated neutral texts as semantic ballast. Unlike prior work that treats LLMs as static generators, our approach dynamically selects neutral exemplars to provide balanced guidance. Toxic samples are explicitly optimized to diverge from these exemplars, reinforcing class-specific contrastive signals. Experiments on four hate speech benchmarks show that ToxiGAN achieves the strongest average performance in both macro-F1 and hate-F1, consistently outperforming traditional and LLM-based augmentation methods. Ablation and sensitivity analyses further confirm the benefits of semantic ballast and directional training in enhancing classifier robustness.
Problem

Research questions and friction points this paper is trying to address.

toxic data augmentation
class-specific augmentation
distributional skew
limited supervision
toxicity classification
Innovation

Methods, ideas, or system contributions that make the work stand out.

ToxiGAN
adversarial generation
LLM-guided augmentation
semantic ballast
directional training
πŸ”Ž Similar Papers