Adaptive Generation of Bias-Eliciting Questions for LLMs

📅 2025-10-14

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

Existing bias evaluation benchmarks predominantly rely on templated prompts or simplistic multiple-choice questions, failing to capture the nuanced bias manifestations observed in authentic user interactions. To address this limitation, we propose the Counterfactual Bias Assessment (CAB) framework, which employs an adaptive, iterative generation process to produce template-free, open-ended questions that probe model behavior—such as refusal to answer or explicit bias acknowledgment—across sensitive attributes including gender, race, and religion. Our key innovation lies in replacing predefined structures with a human-validated mutation–selection mechanism to curate high-quality, bias-inducing questions. Based on this, we release CAB, the first benchmark explicitly designed for realistic interaction scenarios, enabling multidimensional, cross-model bias comparison. Empirical evaluation reveals persistent systemic biases across state-of-the-art large language models, including GPT-5, which exhibits consistent bias under specific contextual conditions.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are now widely deployed in user-facing applications, reaching hundreds of millions worldwide. As they become integrated into everyday tasks, growing reliance on their outputs raises significant concerns. In particular, users may unknowingly be exposed to model-inherent biases that systematically disadvantage or stereotype certain groups. However, existing bias benchmarks continue to rely on templated prompts or restrictive multiple-choice questions that are suggestive, simplistic, and fail to capture the complexity of real-world user interactions. In this work, we address this gap by introducing a counterfactual bias evaluation framework that automatically generates realistic, open-ended questions over sensitive attributes such as sex, race, or religion. By iteratively mutating and selecting bias-inducing questions, our approach systematically explores areas where models are most susceptible to biased behavior. Beyond detecting harmful biases, we also capture distinct response dimensions that are increasingly relevant in user interactions, such as asymmetric refusals and explicit acknowledgment of bias. Leveraging our framework, we construct CAB, a human-verified benchmark spanning diverse topics, designed to enable cross-model comparisons. Using CAB, we analyze a range of LLMs across multiple bias dimensions, revealing nuanced insights into how different models manifest bias. For instance, while GPT-5 outperforms other models, it nonetheless exhibits persistent biases in specific scenarios. These findings underscore the need for continual improvements to ensure fair model behavior.

Problem

Research questions and friction points this paper is trying to address.

Automatically generates realistic bias-eliciting questions for LLMs

Systematically explores model susceptibility to biased behavior

Enables cross-model comparisons of nuanced bias manifestations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates bias-eliciting questions adaptively via counterfactuals

Iteratively mutates and selects questions to expose biases

Constructs human-verified benchmark for cross-model bias analysis

🔎 Similar Papers

Are Large Language Models Really Bias-Free? Jailbreak Prompts for Assessing Adversarial Robustness to Bias Elicitation