Benchmarking Adversarial Robustness to Bias Elicitation in Large Language Models: Scalable Automated Assessment with LLM-as-a-Judge

πŸ“… 2025-04-10
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Large language models (LLMs) deployed in socially critical domains are vulnerable to adversarial bias induction, undermining fairness and robustness. To address this, we propose the first adversarial robustness evaluation paradigm specifically targeting bias induction, introducing CLEARβ€”a scalable benchmarking framework that integrates multidimensional sociocultural probes, LLM-as-a-Judge automated safety scoring, and jailbreak vulnerability detection. We further release CLEAR-Bias, the first standardized bias-trigger dataset. Our methodology enables cross-scale model comparison and domain-specific bias auditing (e.g., healthcare models). Experimental results reveal a significant trade-off between model scale and bias robustness, empirically confirming the widespread susceptibility of both foundation and fine-tuned models. CLEAR-Bias is publicly released to support foundational research at the intersection of fairness and robustness.

Technology Category

Application Category

πŸ“ Abstract
Large Language Models (LLMs) have revolutionized artificial intelligence, driving advancements in machine translation, summarization, and conversational agents. However, their increasing integration into critical societal domains has raised concerns about embedded biases, which can perpetuate stereotypes and compromise fairness. These biases stem from various sources, including historical inequalities in training data, linguistic imbalances, and adversarial manipulation. Despite mitigation efforts, recent studies indicate that LLMs remain vulnerable to adversarial attacks designed to elicit biased responses. This work proposes a scalable benchmarking framework to evaluate LLM robustness against adversarial bias elicitation. Our methodology involves (i) systematically probing models with a multi-task approach targeting biases across various sociocultural dimensions, (ii) quantifying robustness through safety scores using an LLM-as-a-Judge approach for automated assessment of model responses, and (iii) employing jailbreak techniques to investigate vulnerabilities in safety mechanisms. Our analysis examines prevalent biases in both small and large state-of-the-art models and their impact on model safety. Additionally, we assess the safety of domain-specific models fine-tuned for critical fields, such as medicine. Finally, we release a curated dataset of bias-related prompts, CLEAR-Bias, to facilitate systematic vulnerability benchmarking. Our findings reveal critical trade-offs between model size and safety, aiding the development of fairer and more robust future language models.
Problem

Research questions and friction points this paper is trying to address.

Assessing LLM robustness against adversarial bias elicitation
Quantifying biases across sociocultural dimensions in LLMs
Evaluating safety trade-offs between model size and robustness
Innovation

Methods, ideas, or system contributions that make the work stand out.

Scalable benchmarking framework for bias evaluation
LLM-as-a-Judge for automated safety scoring
Jailbreak techniques to test safety vulnerabilities
πŸ”Ž Similar Papers
No similar papers found.