Evaluating Large Language Models for Detecting Antisemitism

📅 2025-09-22

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

This study addresses the challenge of detecting dynamically evolving antisemitic content on social media. We systematically evaluate the detection capabilities of eight open-source large language models (LLMs). To enhance policy-aligned, controllable reasoning, we propose Guided Chain-of-Thought (Guided-CoT), a prompting method that explicitly integrates anti-discrimination policy definitions into the inference process. Compared to standard CoT and in-context learning, Guided-CoT significantly improves performance across all evaluated models—including enabling Llama 3.1 70B to surpass fine-tuned GPT-3.5. Furthermore, we introduce two novel quantitative metrics—semantic deviation score and logical contradiction rate—to rigorously assess model interpretability, reliability, and deployment feasibility. Our results reveal fundamental disparities among LLMs along these dimensions. The work establishes a reproducible, auditable methodology framework for LLM-driven, fine-grained governance of hate speech.

Technology Category

Application Category

📝 Abstract

Detecting hateful content is a challenging and important problem. Automated tools, like machine-learning models, can help, but they require continuous training to adapt to the ever-changing landscape of social media. In this work, we evaluate eight open-source LLMs' capability to detect antisemitic content, specifically leveraging in-context definition as a policy guideline. We explore various prompting techniques and design a new CoT-like prompt, Guided-CoT. Guided-CoT handles the in-context policy well, increasing performance across all evaluated models, regardless of decoding configuration, model sizes, or reasoning capability. Notably, Llama 3.1 70B outperforms fine-tuned GPT-3.5. Additionally, we examine LLM errors and introduce metrics to quantify semantic divergence in model-generated rationales, revealing notable differences and paradoxical behaviors among LLMs. Our experiments highlight the differences observed across LLMs' utility, explainability, and reliability.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs for detecting antisemitic hate content online

Assessing LLM performance using in-context policy guidelines

Analyzing semantic divergence in model rationales and errors

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluating LLMs with in-context policy guidelines

Introducing Guided-CoT prompting technique

Quantifying semantic divergence in model rationales

🔎 Similar Papers

No similar papers found.