HateBench: Benchmarking Hate Speech Detectors on LLM-Generated Content and Hate Campaigns

📅 2025-01-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) pose a novel threat to hate speech detection systems by generating highly persuasive, contextually adaptive hateful content that evades existing detectors. Method: We construct the first benchmark specifically for LLM-generated hate speech—comprising 7,838 samples from six mainstream LLMs across 34 identity groups—and systematically evaluate the robustness of eight representative detection paradigms. Contribution/Results: We empirically uncover a critical degradation trend: detector performance declines significantly with successive LLM versions. Furthermore, we propose and validate the first LLM-driven automated hate campaign framework, integrating adversarial text perturbations with model extraction attacks. This approach achieves a high attack success rate (0.966) while accelerating evasion efficiency by 13–21× over baselines. Our work establishes a foundational benchmark, introduces a novel attack methodology, and provides empirical evidence essential for evaluating and enhancing the robustness of hate speech detection systems in the LLM era.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have raised increasing concerns about their misuse in generating hate speech. Among all the efforts to address this issue, hate speech detectors play a crucial role. However, the effectiveness of different detectors against LLM-generated hate speech remains largely unknown. In this paper, we propose HateBench, a framework for benchmarking hate speech detectors on LLM-generated hate speech. We first construct a hate speech dataset of 7,838 samples generated by six widely-used LLMs covering 34 identity groups, with meticulous annotations by three labelers. We then assess the effectiveness of eight representative hate speech detectors on the LLM-generated dataset. Our results show that while detectors are generally effective in identifying LLM-generated hate speech, their performance degrades with newer versions of LLMs. We also reveal the potential of LLM-driven hate campaigns, a new threat that LLMs bring to the field of hate speech detection. By leveraging advanced techniques like adversarial attacks and model stealing attacks, the adversary can intentionally evade the detector and automate hate campaigns online. The most potent adversarial attack achieves an attack success rate of 0.966, and its attack efficiency can be further improved by $13-21 imes$ through model stealing attacks with acceptable attack performance. We hope our study can serve as a call to action for the research community and platform moderators to fortify defenses against these emerging threats.
Problem

Research questions and friction points this paper is trying to address.

Hate Speech Detection
Large Language Models
Malicious Content
Innovation

Methods, ideas, or system contributions that make the work stand out.

HateBench
hatred speech detection
advanced language models