Bias after Prompting: Persistent Discrimination in Large Language Models

📅 2025-09-09

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

This study investigates whether social biases embedded in pretrained large language models (LLMs) persistently transfer to downstream tasks during prompt tuning—a lightweight adaptation method that modifies only soft prompts while freezing model parameters. Method: We conduct cross-dimensional bias correlation analysis (across gender, age, religion, etc.), few-shot prompt perturbation experiments, and systematic evaluation of mainstream prompt-based debiasing strategies. Contribution/Results: (1) Prompt-tuned models retain significant original biases, with strong positive correlation between intrinsic and post-tuning bias; (2) existing prompt-level debiasing methods fail to consistently suppress bias transfer; (3) bias propagation exhibits robustness across tasks, models, and demographic groups. Our findings challenge the implicit assumption that pretraining biases do not propagate through prompt tuning, providing the first systematic evidence that the prompt layer cannot act as a barrier to bias transmission. We argue for joint optimization of bias mitigation and reasoning capability at the foundational model level.

Technology Category

Application Category

📝 Abstract

A dangerous assumption that can be made from prior work on the bias transfer hypothesis (BTH) is that biases do not transfer from pre-trained large language models (LLMs) to adapted models. We invalidate this assumption by studying the BTH in causal models under prompt adaptations, as prompting is an extremely popular and accessible adaptation strategy used in real-world applications. In contrast to prior work, we find that biases can transfer through prompting and that popular prompt-based mitigation methods do not consistently prevent biases from transferring. Specifically, the correlation between intrinsic biases and those after prompt adaptation remain moderate to strong across demographics and tasks -- for example, gender (rho >= 0.94) in co-reference resolution, and age (rho >= 0.98) and religion (rho >= 0.69) in question answering. Further, we find that biases remain strongly correlated when varying few-shot composition parameters, such as sample size, stereotypical content, occupational distribution and representational balance (rho >= 0.90). We evaluate several prompt-based debiasing strategies and find that different approaches have distinct strengths, but none consistently reduce bias transfer across models, tasks or demographics. These results demonstrate that correcting bias, and potentially improving reasoning ability, in intrinsic models may prevent propagation of biases to downstream tasks.

Problem

Research questions and friction points this paper is trying to address.

Biases transfer from pre-trained LLMs to adapted models via prompting

Prompt-based mitigation methods fail to consistently prevent bias transfer

Debiasing strategies show inconsistent effectiveness across models and tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Studying bias transfer in causal models via prompting

Finding biases persist across demographics and tasks

Evaluating prompt-based debiasing strategies with limited success

🔎 Similar Papers

No similar papers found.