A Modular Taxonomy for Hate Speech Definitions and Its Impact on Zero-Shot LLM Classification Performance

📅 2025-06-23

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

This study addresses the instability of zero-shot large language model (LLM) classification for hate speech, stemming from ambiguous and heterogeneous definitions. We propose the first modular taxonomy of hate speech definitions, comprising 14 composable conceptual elements grounded in a systematic literature review. We rigorously evaluate how varying definitions—structured via different combinations and abstraction levels of these elements—affect zero-shot classification performance across three data regimes: synthetic, human-AI collaborative, and real-world datasets. Results show that definition composition and granularity significantly influence LLM performance, with effects varying across model architectures; no universally optimal definition exists. Our work demonstrates that definition choice critically impacts LLM reliability in deployment, providing both theoretical foundations and methodological guidance for developing interpretable, robust hate speech detection systems. (149 words)

Technology Category

Application Category

📝 Abstract

Detecting harmful content is a crucial task in the landscape of NLP applications for Social Good, with hate speech being one of its most dangerous forms. But what do we mean by hate speech, how can we define it, and how does prompting different definitions of hate speech affect model performance? The contribution of this work is twofold. At the theoretical level, we address the ambiguity surrounding hate speech by collecting and analyzing existing definitions from the literature. We organize these definitions into a taxonomy of 14 Conceptual Elements-building blocks that capture different aspects of hate speech definitions, such as references to the target of hate (individual or groups) or of the potential consequences of it. At the experimental level, we employ the collection of definitions in a systematic zero-shot evaluation of three LLMs, on three hate speech datasets representing different types of data (synthetic, human-in-the-loop, and real-world). We find that choosing different definitions, i.e., definitions with a different degree of specificity in terms of encoded elements, impacts model performance, but this effect is not consistent across all architectures.

Problem

Research questions and friction points this paper is trying to address.

Addressing ambiguity in defining hate speech for NLP

Evaluating impact of hate speech definitions on LLM performance

Developing taxonomy to classify hate speech conceptual elements

Innovation

Methods, ideas, or system contributions that make the work stand out.

Modular taxonomy for hate speech definitions

Zero-shot LLM classification performance evaluation

Impact of definition specificity on models

🔎 Similar Papers

Exploring the Limits of Zero Shot Vision Language Models for Hate Meme Detection: The Vulnerabilities and their Interpretations