🤖 AI Summary
This study addresses the challenge of detecting dynamically evolving antisemitic content on social media. We systematically evaluate the detection capabilities of eight open-source large language models (LLMs). To enhance policy-aligned, controllable reasoning, we propose Guided Chain-of-Thought (Guided-CoT), a prompting method that explicitly integrates anti-discrimination policy definitions into the inference process. Compared to standard CoT and in-context learning, Guided-CoT significantly improves performance across all evaluated models—including enabling Llama 3.1 70B to surpass fine-tuned GPT-3.5. Furthermore, we introduce two novel quantitative metrics—semantic deviation score and logical contradiction rate—to rigorously assess model interpretability, reliability, and deployment feasibility. Our results reveal fundamental disparities among LLMs along these dimensions. The work establishes a reproducible, auditable methodology framework for LLM-driven, fine-grained governance of hate speech.
📝 Abstract
Detecting hateful content is a challenging and important problem. Automated tools, like machine-learning models, can help, but they require continuous training to adapt to the ever-changing landscape of social media. In this work, we evaluate eight open-source LLMs' capability to detect antisemitic content, specifically leveraging in-context definition as a policy guideline. We explore various prompting techniques and design a new CoT-like prompt, Guided-CoT. Guided-CoT handles the in-context policy well, increasing performance across all evaluated models, regardless of decoding configuration, model sizes, or reasoning capability. Notably, Llama 3.1 70B outperforms fine-tuned GPT-3.5. Additionally, we examine LLM errors and introduce metrics to quantify semantic divergence in model-generated rationales, revealing notable differences and paradoxical behaviors among LLMs. Our experiments highlight the differences observed across LLMs' utility, explainability, and reliability.