Attribute-Based Diagnosis of LLM Alignment with Hate Speech Annotations

📅 2026-05-26

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This study addresses the challenges of high annotation costs, subjectivity, and low inter-annotator agreement in constructing high-quality hate speech datasets. Leveraging ten theoretically grounded subjective attributes—such as dehumanization, violence, and emotion—the authors systematically evaluate the alignment between large language models (LLaMA 3.1 and Qwen 2.5 series) and human judgments at the attribute level. While models perform well on behavioral-expressive dimensions, they exhibit systematic biases on evaluative dimensions. To mitigate this, the work proposes a confidence-weighted attribute fusion approach that integrates role-conditioned prompting with Ridge regression to reconstruct continuous hate speech scores. Evaluated on the Measuring Hate Speech corpus, the method achieves an R² of up to 0.71, significantly outperforming end-to-end prompting baselines and offering a novel, interpretable, and cost-effective paradigm for hate speech assessment.

📝 Abstract

Hate speech annotation is costly, subjective, and prone to annotator disagreement, making large-scale dataset construction challenging. We systematically analyze how well large language models (LLMs) align with human judgments across ten theoretically grounded subjective attributes, such as dehumanization, violence, and sentiment, evaluating both small and large variants of Llama 3.1 and Qwen 2.5. Our analysis reveals a consistent split across all models: behaviorally explicit dimensions (insult, humiliate, attack-defend) correlate strongly with human annotations, while evaluative dimensions (respect, sentiment, hate speech) are systematically inverted. Demographic persona conditioning reduces model confidence without improving alignment. Building on these insights, we propose combining attribute-level LLM predictions via a confidence-weighted Ridge regression to reconstruct continuous hate speech scores from the Measuring Hate Speech corpus, achieving $R^2$ of up to 0.71 and outperforming direct prompting baselines, demonstrating that structured attribute decomposition recovers a richer and more human-aligned signal than end-to-end label prediction alone.

Problem

Research questions and friction points this paper is trying to address.

hate speech annotation

annotator disagreement

subjective attributes

human alignment

large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

attribute-based diagnosis

LLM alignment

hate speech annotation