Investigating Counterfactual Unfairness in LLMs towards Identities through Humor

📅 2026-04-20
📈 Citations: 0
Influential: 0
📄 PDF

career value

193K/year
🤖 AI Summary
This study addresses the risk that large language models may exhibit identity-related counterfactual unfairness in humor understanding due to internalized social biases. The authors propose an interpretable bias measurement framework that employs counterfactual identity-swapping experiments, holding contextual content constant, to systematically evaluate disparities in model responses across tasks including humor rejection, intent inference, and social impact prediction. Empirical analyses across multiple state-of-the-art large language models reveal that jokes attributed to privileged identities are rejected 67.5% more often, deemed malicious with 64.7% higher probability, and assigned social harm scores up to 1.5 points higher (on a 5-point scale) than those attributed to marginalized identities. These findings uncover, for the first time, a relational pattern of unfairness in humorous contexts where sensitivity and stereotyping coexist.

Technology Category

Application Category

📝 Abstract
Humor holds up a mirror to social perception: what we find funny often reflects who we are and how we judge others. When language models engage with humor, their reactions expose the social assumptions they have internalized from training data. In this paper, we investigate counterfactual unfairness through humor by observing how the model's responses change when we swap who speaks and who is addressed while holding other factors constant. Our framework spans three tasks: humor generation refusal, speaker intention inference, and relational/societal impact prediction, covering both identity-agnostic humor and identity-specific disparagement humor. We introduce interpretable bias metrics that capture asymmetric patterns under identity swaps. Experiments across state-of-the-art models reveal consistent relational disparities: jokes told by privileged speakers are refused up to 67.5% more often, judged as malicious 64.7% more frequently, and rated up to 1.5 points higher in social harm on a 5-point scale. These patterns highlight how sensitivity and stereotyping coexist in generative models, complicating efforts toward fairness and cultural alignment.
Problem

Research questions and friction points this paper is trying to address.

counterfactual unfairness
large language models
humor
social identity
bias
Innovation

Methods, ideas, or system contributions that make the work stand out.

counterfactual unfairness
humor analysis
identity bias
interpretable metrics
language model fairness