SoftHateBench: Evaluating Moderation Models Against Reasoning-Driven, Policy-Compliant Hostility

📅 2026-01-28

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

This work addresses the challenge that current content moderation systems struggle to detect “soft hate speech”—implicitly hostile content conveyed through inferential framing—due to the absence of systematic evaluation benchmarks. To bridge this gap, the authors propose SoftHateBench, a novel framework that integrates the Argumentum Topic Model (AMT) with Relevance Theory (RT) to controllably rewrite explicit hate speech into logically coherent, surface-neutral yet stance-preserving soft hate utterances. The benchmark spans seven sociocultural domains and 28 target groups, yielding 4,745 generated samples. Experiments reveal a significant performance drop across mainstream moderation models—including encoder-based detectors, general-purpose large language models, and safety-aligned systems—highlighting their vulnerability in recognizing inference-driven implicit hate.

Technology Category

Application Category

📝 Abstract

Online hate on social media ranges from overt slurs and threats (\emph{hard hate speech}) to \emph{soft hate speech}: discourse that appears reasonable on the surface but uses framing and value-based arguments to steer audiences toward blaming or excluding a target group. We hypothesize that current moderation systems, largely optimized for surface toxicity cues, are not robust to this reasoning-driven hostility, yet existing benchmarks do not measure this gap systematically. We introduce \textbf{\textsc{SoftHateBench}}, a generative benchmark that produces soft-hate variants while preserving the underlying hostile standpoint. To generate soft hate, we integrate the \emph{Argumentum Model of Topics} (AMT) and \emph{Relevance Theory} (RT) in a unified framework: AMT provides the backbone argument structure for rewriting an explicit hateful standpoint into a seemingly neutral discussion while preserving the stance, and RT guides generation to keep the AMT chain logically coherent. The benchmark spans \textbf{7} sociocultural domains and \textbf{28} target groups, comprising \textbf{4,745} soft-hate instances. Evaluations across encoder-based detectors, general-purpose LLMs, and safety models show a consistent drop from hard to soft tiers: systems that detect explicit hostility often fail when the same stance is conveyed through subtle, reasoning-based language. \textcolor{red}{\textbf{Disclaimer.} Contains offensive examples used solely for research.}

Problem

Research questions and friction points this paper is trying to address.

soft hate speech

content moderation

reasoning-driven hostility

policy compliance

hate speech detection

Innovation

Methods, ideas, or system contributions that make the work stand out.

SoftHateBench

soft hate speech

Argumentum Model of Topics