Hateful Person or Hateful Model? Investigating the Role of Personas in Hate Speech Detection by Large Language Models

📅 2025-06-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates, for the first time, how MBTI-based personality prompts systematically affect large language models’ (LLMs) hate speech detection performance—a critical yet underexplored source of bias in trustworthy AI. Method: We inject personality-aware prompts into four open-source LLMs (e.g., Llama, Qwen), employing logit-level bias analysis, cross-model comparative evaluation, and human-annotated validation. Contribution/Results: Personality prompting induces substantial systematic bias: average cross-personality F1 scores fluctuate by 12.3% (up to 28.7%), 32% of instances undergo label reversal, and logit distributions exhibit reproducible dimensional preferences. The findings expose structural vulnerabilities in LLMs’ consistency, factual grounding, and cross-personality discrimination. We propose “personality-controllable annotation” as a novel paradigm, offering both theoretical foundations and empirical evidence for robustness-aware prompt engineering and trustworthy AI evaluation.

Technology Category

Application Category

📝 Abstract
Hate speech detection is a socially sensitive and inherently subjective task, with judgments often varying based on personal traits. While prior work has examined how socio-demographic factors influence annotation, the impact of personality traits on Large Language Models (LLMs) remains largely unexplored. In this paper, we present the first comprehensive study on the role of persona prompts in hate speech classification, focusing on MBTI-based traits. A human annotation survey confirms that MBTI dimensions significantly affect labeling behavior. Extending this to LLMs, we prompt four open-source models with MBTI personas and evaluate their outputs across three hate speech datasets. Our analysis uncovers substantial persona-driven variation, including inconsistencies with ground truth, inter-persona disagreement, and logit-level biases. These findings highlight the need to carefully define persona prompts in LLM-based annotation workflows, with implications for fairness and alignment with human values.
Problem

Research questions and friction points this paper is trying to address.

Investigates how personas affect hate speech detection by LLMs
Examines MBTI traits' impact on LLM classification consistency
Highlights persona-driven biases in hate speech annotation workflows
Innovation

Methods, ideas, or system contributions that make the work stand out.

Using MBTI personas for hate speech detection
Evaluating LLM outputs across multiple datasets
Analyzing persona-driven biases in model behavior