🤖 AI Summary
Current approaches to evaluating bias in large language models often suffer from an imbalance between ecological validity and statistical rigor—either relying on synthetic prompts lacking real-world representativeness or failing to scale effectively. This work proposes a scalable, named-entity-based bias auditing framework that leverages entity probes to measure structural disparities in model outputs at scale and uses synthetic data to replicate bias patterns observed in real-world texts. The framework enables the first billion-scale, multidimensional bias audit, analyzing 1.9 billion data points. It reveals systematic model preferences for left-leaning politicians, Western nations, and corporations, alongside discrimination against the defense and pharmaceutical sectors. Furthermore, while instruction tuning mitigates bias, model scaling exacerbates it, and non-English prompts fail to eliminate the models’ Western-centric tendencies.
📝 Abstract
Existing approaches to bias evaluation in large language models (LLMs) trade ecological validity for statistical control, relying on artificial prompts that poorly reflect real-world use, or on naturalistic tasks that lack scale and rigor. We introduce a scalable bias-auditing framework using named entities as probes to measure structural disparities in model behavior. We show that synthetic data reliably reproduces bias patterns observed in natural text, enabling large-scale analysis. Using this approach, we conduct the largest bias audit to date, comprising 1.9 billion data points across multiple entity types, tasks, languages, models, and prompting strategies. Our results reveal systematic biases: models penalize right-wing politicians, favor left-wing politicians, prefer Western and wealthy nations over the Global South, favor Western companies, and penalize firms in the defense and pharmaceutical sectors. While instruction tuning reduces bias, increasing model scale amplifies it, and prompting in Chinese or Russian does not attenuate Western-aligned preferences. These results indicate that LLMs should undergo rigorous auditing before deployment in high-stakes applications.