Analyzing Bias in False Refusal Behavior of Large Language Models for Hate Speech Detoxification

📅 2026-01-13

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

This study addresses the issue of false refusals in large language models (LLMs) during hate speech detoxification tasks, which often arise from over-sensitive safety mechanisms and exhibit systematic biases against groups defined by nationality, religion, or political affiliation. The authors conduct a systematic analysis to uncover the semantic toxicity and group-targeted characteristics underlying these false refusals. They propose a novel, lightweight mitigation strategy based on bidirectional English–Chinese translation. Experimental results across nine mainstream LLMs demonstrate that the approach significantly reduces false refusal rates for English inputs while preserving semantic integrity in multilingual contexts, effectively balancing safety and fairness.

Technology Category

Application Category

📝 Abstract

While large language models (LLMs) have increasingly been applied to hate speech detoxification, the prompts often trigger safety alerts, causing LLMs to refuse the task. In this study, we systematically investigate false refusal behavior in hate speech detoxification and analyze the contextual and linguistic biases that trigger such refusals. We evaluate nine LLMs on both English and multilingual datasets, our results show that LLMs disproportionately refuse inputs with higher semantic toxicity and those targeting specific groups, particularly nationality, religion, and political ideology. Although multilingual datasets exhibit lower overall false refusal rates than English datasets, models still display systematic, language-dependent biases toward certain targets. Based on these findings, we propose a simple cross-translation strategy, translating English hate speech into Chinese for detoxification and back, which substantially reduces false refusals while preserving the original content, providing an effective and lightweight mitigation approach.

Problem

Research questions and friction points this paper is trying to address.

false refusal

hate speech detoxification

bias

large language models

safety alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

false refusal

hate speech detoxification

linguistic bias