🤖 AI Summary
This work addresses critical gaps in automated counter-narrative (CN) generation against online hate speech—specifically, insufficient emotional appropriateness, accessibility, and ethical robustness. We introduce the first persona-guided, four-dimensional evaluation framework, assessing persona consistency, readability, affective tone, and ethical robustness. Using MT-Conan and HatEval benchmarks, we systematically evaluate GPT-4o-Mini, CommandR-7B, and LLaMA 3.1-70B. Results reveal that current LLMs produce overly verbose, literacy-intensive outputs; while affect-guided prompting improves empathy and readability, it concurrently increases safety risks—uncovering a fundamental trade-off. Our key contribution is a novel, human-centered, multi-dimensional evaluation paradigm grounded in real-user accessibility and socio-emotional alignment. This framework provides both methodological foundations and empirical evidence for developing safe, inclusive, and effective counter-narrative technologies.
📝 Abstract
Automated counter-narratives (CN) offer a promising strategy for mitigating online hate speech, yet concerns about their affective tone, accessibility, and ethical risks remain. We propose a framework for evaluating Large Language Model (LLM)-generated CNs across four dimensions: persona framing, verbosity and readability, affective tone, and ethical robustness. Using GPT-4o-Mini, Cohere's CommandR-7B, and Meta's LLaMA 3.1-70B, we assess three prompting strategies on the MT-Conan and HatEval datasets. Our findings reveal that LLM-generated CNs are often verbose and adapted for people with college-level literacy, limiting their accessibility. While emotionally guided prompts yield more empathetic and readable responses, there remain concerns surrounding safety and effectiveness.