🤖 AI Summary
Social stigma toward marginalized groups may systematically bias large language model (LLM) outputs, yet the psychological mechanisms linking stigma attributes to LLM bias remain uncharacterized.
Method: We systematically analyze associations between six core social stigma psychology dimensions—e.g., dangerousness, concealability, origin—and LLM bias across 93 stigmatized groups, using the SocialStigmaQA benchmark and human-annotated ground truth. We evaluate three open-weight models (Granite-3.0-8B, Llama-3.1-8B, Mistral-7B) and their corresponding guardrail models (e.g., Granite Guardian).
Contribution/Results: We identify a strong positive correlation between stigma attributes—particularly dangerousness—and LLM bias magnitude: bias rates reach 60% for high-dangerousness groups (e.g., gang members, people living with HIV), versus only 11% for sociodemographic groups. Guardrail models reduce bias by only 7.2% on average (max 10.4%), fail to detect bias intent, preserve core stigma-driving features, and miss >50% of biased prompts. This work establishes a psychological foundation and empirical evidence for modeling and governing LLM bias at its root.
📝 Abstract
Large language models (LLMs) have been shown to exhibit social bias, however, bias towards non-protected stigmatized identities remain understudied. Furthermore, what social features of stigmas are associated with bias in LLM outputs is unknown. From psychology literature, it has been shown that stigmas contain six shared social features: aesthetics, concealability, course, disruptiveness, origin, and peril. In this study, we investigate if human and LLM ratings of the features of stigmas, along with prompt style and type of stigma, have effect on bias towards stigmatized groups in LLM outputs. We measure bias against 93 stigmatized groups across three widely used LLMs (Granite 3.0-8B, Llama-3.1-8B, Mistral-7B) using SocialStigmaQA, a benchmark that includes 37 social scenarios about stigmatized identities; for example deciding wether to recommend them for an internship. We find that stigmas rated by humans to be highly perilous (e.g., being a gang member or having HIV) have the most biased outputs from SocialStigmaQA prompts (60% of outputs from all models) while sociodemographic stigmas (e.g. Asian-American or old age) have the least amount of biased outputs (11%). We test if the amount of biased outputs could be decreased by using guardrail models, models meant to identify harmful input, using each LLM's respective guardrail model (Granite Guardian 3.0, Llama Guard 3.0, Mistral Moderation API). We find that bias decreases significantly by 10.4%, 1.4%, and 7.8%, respectively. However, we show that features with significant effect on bias remain unchanged post-mitigation and that guardrail models often fail to recognize the intent of bias in prompts. This work has implications for using LLMs in scenarios involving stigmatized groups and we suggest future work towards improving guardrail models for bias mitigation.