ChatGPT Doesn’t Trust Chargers Fans: Guardrail Sensitivity in Context

📅 2024-07-09

🏛️ Conference on Empirical Methods in Natural Language Processing

📈 Citations: 5

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This study investigates whether large language model (LLM) safety guardrails—exemplified by GPT-3.5—exhibit systematic refusal bias conditioned on implicit user identity cues (e.g., age, gender, ethnicity, sports preferences) and whether such guardrails implicitly infer political stance to modulate response sensitivity. Using controlled persona generation, counterfactual prompting, statistical significance testing, and bias attribution analysis, we uncover three key findings: (1) ostensibly neutral identity signals—such as NFL team preference—reliably trigger political inference, yielding statistically significant refusal-rate disparities; (2) guardrails exhibit “sycophantic” bias—systematically refusing requests aligned with politically opposed stances; and (3) refusal behavior varies significantly across multiple intersecting identity dimensions. These results demonstrate that current safety mechanisms possess latent identity sensitivity and context-driven political reasoning capabilities—challenging assumptions of neutrality and raising critical concerns regarding fairness, transparency, and interpretability in AI safety architectures.

Technology Category

Application Category

📝 Abstract

While the biases of language models in production are extensively documented, the biases of their guardrails have been neglected. This paper studies how contextual information about the user influences the likelihood of an LLM to refuse to execute a request. By generating user biographies that offer ideological and demographic information, we find a number of biases in guardrail sensitivity on GPT-3.5. Younger, female, and Asian-American personas are more likely to trigger a refusal guardrail when requesting censored or illegal information. Guardrails are also sycophantic, refusing to comply with requests for a political position the user is likely to disagree with. We find that certain identity groups and seemingly innocuous information, e.g., sports fandom, can elicit changes in guardrail sensitivity similar to direct statements of political ideology. For each demographic category and even for American football team fandom, we find that ChatGPT appears to infer a likely political ideology and modify guardrail behavior accordingly.

Problem

Research questions and friction points this paper is trying to address.

Biases in LLM guardrail sensitivity to user context

How demographic and ideological information affects refusal rates

Guardrails infer political ideology from seemingly innocuous details

Innovation

Methods, ideas, or system contributions that make the work stand out.

User biographies test guardrail bias

Demographic traits influence refusal likelihood

Sports fandom triggers political inference adjustments

🔎 Similar Papers

Can tweets predict article retractions? A comparison between human and LLM labelling