🤖 AI Summary
This study addresses the limitations of current hate speech detection systems, which often rely on opaque content removal mechanisms that lack interpretability and risk infringing on freedom of expression. To overcome these challenges, the authors propose an innovative hybrid approach that integrates large language models with a manually curated multilingual (English, French, and Greek) hate lexicon. The method employs a dual-pipeline architecture to separately identify identity-denigrating terms and group-targeted content, enhanced by context-aware evaluation and term disambiguation to produce grounded, human-interpretable justifications for classification decisions. Experimental results demonstrate that this approach significantly outperforms pure large language model baselines in both detection accuracy and explanation quality, offering a more transparent and trustworthy framework for content moderation.
📝 Abstract
Hate, derogatory, and offensive speech remains a persistent challenge in online platforms and public discourse. While automated detection systems are widely used, most focus on censorship or removal, raising concerns for transparency and freedom of expression, and limiting opportunities to explain why content is harmful. To address these issues, explanatory approaches have emerged as a promising solution, aiming to make hate speech detection more transparent, accountable, and informative. In this paper, we present a hybrid approach that combines Large Language Models (LLMs) with three newly created and curated vocabularies to detect and explain hate speech in English, French, and Greek. Our system captures both inherently derogatory expressions tied to identity characteristics and direct group-targeted content through two complementary pipelines: one that detects and disambiguates problematic terms using the curated vocabularies, and one that leverages LLMs as context-aware evaluators of group-targeting content. The outputs are fused into grounded explanations that clarify why content is flagged. Human evaluation shows that our hybrid approach is accurate, with high-quality explanations, outperforming LLM-only baselines.