Bias after Prompting: Persistent Discrimination in Large Language Models

πŸ“… 2025-09-09
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This study investigates whether social biases embedded in pretrained large language models (LLMs) persistently transfer to downstream tasks during prompt tuningβ€”a lightweight adaptation method that modifies only soft prompts while freezing model parameters. Method: We conduct cross-dimensional bias correlation analysis (across gender, age, religion, etc.), few-shot prompt perturbation experiments, and systematic evaluation of mainstream prompt-based debiasing strategies. Contribution/Results: (1) Prompt-tuned models retain significant original biases, with strong positive correlation between intrinsic and post-tuning bias; (2) existing prompt-level debiasing methods fail to consistently suppress bias transfer; (3) bias propagation exhibits robustness across tasks, models, and demographic groups. Our findings challenge the implicit assumption that pretraining biases do not propagate through prompt tuning, providing the first systematic evidence that the prompt layer cannot act as a barrier to bias transmission. We argue for joint optimization of bias mitigation and reasoning capability at the foundational model level.

Technology Category

Application Category

πŸ“ Abstract
A dangerous assumption that can be made from prior work on the bias transfer hypothesis (BTH) is that biases do not transfer from pre-trained large language models (LLMs) to adapted models. We invalidate this assumption by studying the BTH in causal models under prompt adaptations, as prompting is an extremely popular and accessible adaptation strategy used in real-world applications. In contrast to prior work, we find that biases can transfer through prompting and that popular prompt-based mitigation methods do not consistently prevent biases from transferring. Specifically, the correlation between intrinsic biases and those after prompt adaptation remain moderate to strong across demographics and tasks -- for example, gender (rho >= 0.94) in co-reference resolution, and age (rho >= 0.98) and religion (rho >= 0.69) in question answering. Further, we find that biases remain strongly correlated when varying few-shot composition parameters, such as sample size, stereotypical content, occupational distribution and representational balance (rho >= 0.90). We evaluate several prompt-based debiasing strategies and find that different approaches have distinct strengths, but none consistently reduce bias transfer across models, tasks or demographics. These results demonstrate that correcting bias, and potentially improving reasoning ability, in intrinsic models may prevent propagation of biases to downstream tasks.
Problem

Research questions and friction points this paper is trying to address.

Biases transfer from pre-trained LLMs to adapted models via prompting
Prompt-based mitigation methods fail to consistently prevent bias transfer
Debiasing strategies show inconsistent effectiveness across models and tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Studying bias transfer in causal models via prompting
Finding biases persist across demographics and tasks
Evaluating prompt-based debiasing strategies with limited success
πŸ”Ž Similar Papers
No similar papers found.