Are Large Language Models Really Bias-Free? Jailbreak Prompts for Assessing Adversarial Robustness to Bias Elicitation

📅 2024-07-11
🏛️ IFIP Working Conference on Database Semantics
📈 Citations: 3
Influential: 0
📄 PDF
🤖 AI Summary
This work systematically investigates the elicitation of implicit social biases in large language models (LLMs). Addressing the limitation of prior research—which predominantly examines explicit biases and lacks controllable elicitation mechanisms—we propose the first taxonomy and evaluation benchmark specifically designed for bias elicitation. Our method introduces a multi-strategy prompt engineering framework integrating semantic perturbation, role-playing, and context injection, evaluated via dual-track assessment combining human annotation and inter-annotator consistency checks. Experiments across major open- and closed-source LLMs demonstrate that our approach increases bias response rates by an average factor of 3.8×, effectively exposing critical robustness deficiencies beneath models’ superficial neutrality. This work establishes a reproducible methodological foundation and empirical evidence for bias detection, fairness evaluation, and alignment optimization in LLMs.

Technology Category

Application Category

Problem

Research questions and friction points this paper is trying to address.

Assessing biases in Large Language Models
Testing adversarial robustness with jailbreak prompts
Enhancing mitigation techniques for bias reduction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Jailbreak prompts reveal biases
Analyze LLM adversarial robustness
Enhance bias mitigation techniques
🔎 Similar Papers