Who Speaks Matters: Analysing the Influence of the Speaker's Ethnicity on Hate Classification

📅 2024-10-27

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This study investigates the robustness and bias of large language models (LLMs) in hate speech classification with respect to ethnic markers. To quantify ethnic sensitivity, we systematically inject both explicit (e.g., identity statements) and implicit (e.g., African American Language features) ethnic markers into inputs and measure output flip rates—instances where classification labels change under controlled perturbations. Experiments involve four state-of-the-art LLMs, employing controlled textual perturbation, consistency analysis, and cross-ethnic comparative evaluation. Key contributions include: (1) implicit dialectal markers induce significantly higher flip rates than explicit ones; (2) flip rates vary across ethnic groups by up to 2.3×, revealing ethnic heterogeneity bias; and (3) model scale positively correlates with robustness—larger models reduce average flip rates by 37%. These findings underscore the critical role of implicit linguistic features in fairness assessment and establish a new empirical benchmark for bias detection and robustness enhancement in LLMs.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) offer a lucrative promise for scalable content moderation, including hate speech detection. However, they are also known to be brittle and biased against marginalised communities and dialects. This requires their applications to high-stakes tasks like hate speech detection to be critically scrutinized. In this work, we investigate the robustness of hate speech classification using LLMs, particularly when explicit and implicit markers of the speaker's ethnicity are injected into the input. For the explicit markers, we inject a phrase that mentions the speaker's identity. For the implicit markers, we inject dialectal features. By analysing how frequently model outputs flip in the presence of these markers, we reveal varying degrees of brittleness across 4 popular LLMs and 5 ethnicities. We find that the presence of implicit dialect markers in inputs causes model outputs to flip more than the presence of explicit markers. Further, the percentage of flips varies across ethnicities. Finally, we find that larger models are more robust. Our findings indicate the need for exercising caution in deploying LLMs for high-stakes tasks like hate speech detection.

Problem

Research questions and friction points this paper is trying to address.

Investigating how speaker ethnicity markers affect hate speech classification robustness

Analyzing model output flips caused by explicit and implicit ethnic markers

Evaluating LLM bias variations across different ethnic identities and model sizes

Innovation

Methods, ideas, or system contributions that make the work stand out.

Injected explicit and implicit ethnic markers into inputs

Analyzed output flips across multiple language models

Compared robustness variations across different ethnic groups

🔎 Similar Papers

From Languages to Geographies: Towards Evaluating Cultural Bias in Hate Speech Datasets