🤖 AI Summary
This work exposes the implicit expression and propagation of societal biases—such as racial, gender, and religious prejudices—in large language models (LLMs) during conversational interactions: even models passing static safety checks frequently generate biased outputs in dialogue contexts and fail to self-correct or refuse subsequent biased prompts. To address this, the authors introduce CoBia—the first lightweight, dialogue-oriented adversarial bias-triggering suite—integrating constructed conversational prompts, an LLM-driven bias evaluation metric, and human validation. CoBia conducts systematic stress testing across 11 prominent open- and closed-source LLMs. Experimental results reveal widespread bias persistence and failure to recover from biased inputs, indicating deeply embedded societal biases. The framework demonstrates that conventional safety mechanisms are insufficient for dynamic, multi-turn settings. All code and tools are publicly released to support reproducible bias auditing in conversational AI.
📝 Abstract
Improvements in model construction, including fortified safety guardrails, allow Large language models (LLMs) to increasingly pass standard safety checks. However, LLMs sometimes slip into revealing harmful behavior, such as expressing racist viewpoints, during conversations. To analyze this systematically, we introduce CoBia, a suite of lightweight adversarial attacks that allow us to refine the scope of conditions under which LLMs depart from normative or ethical behavior in conversations. CoBia creates a constructed conversation where the model utters a biased claim about a social group. We then evaluate whether the model can recover from the fabricated bias claim and reject biased follow-up questions. We evaluate 11 open-source as well as proprietary LLMs for their outputs related to six socio-demographic categories that are relevant to individual safety and fair treatment, i.e., gender, race, religion, nationality, sex orientation, and others. Our evaluation is based on established LLM-based bias metrics, and we compare the results against human judgments to scope out the LLMs' reliability and alignment. The results suggest that purposefully constructed conversations reliably reveal bias amplification and that LLMs often fail to reject biased follow-up questions during dialogue. This form of stress-testing highlights deeply embedded biases that can be surfaced through interaction. Code and artifacts are available at https://github.com/nafisenik/CoBia.