Are Large Language Models Really Bias-Free? Jailbreak Prompts for Assessing Adversarial Robustness to Bias Elicitation

📅 2024-07-11

🏛️ IFIP Working Conference on Database Semantics

📈 Citations: 3

✨ Influential: 0

🤖 AI Summary

This work systematically investigates the elicitation of implicit social biases in large language models (LLMs). Addressing the limitation of prior research—which predominantly examines explicit biases and lacks controllable elicitation mechanisms—we propose the first taxonomy and evaluation benchmark specifically designed for bias elicitation. Our method introduces a multi-strategy prompt engineering framework integrating semantic perturbation, role-playing, and context injection, evaluated via dual-track assessment combining human annotation and inter-annotator consistency checks. Experiments across major open- and closed-source LLMs demonstrate that our approach increases bias response rates by an average factor of 3.8×, effectively exposing critical robustness deficiencies beneath models’ superficial neutrality. This work establishes a reproducible methodological foundation and empirical evidence for bias detection, fairness evaluation, and alignment optimization in LLMs.