Breaking the Benchmark: Revealing LLM Bias via Minimal Contextual Augmentation

📅 2025-10-27

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

This work exposes the heightened sensitivity of large language models (LLMs) to stereotypical biases when processing queries concerning underrepresented or understudied demographic groups—particularly under minimal contextual perturbations. To address this, we propose a general three-stage augmentation framework: (1) bias triggering via minimal context perturbations, (2) data augmentation to broaden coverage of diverse groups, and (3) robustness-aware fairness evaluation using benchmarks such as BBQ. Our key contribution is a plug-and-play context augmentation module that systematically uncovers bias amplification in both mainstream open- and closed-source LLMs under perturbation—a phenomenon previously unreported. We further demonstrate the fragility of existing bias-mitigation alignment methods. Experiments reveal significant bias escalation in minority-group-related queries, underscoring the urgent need to extend fairness research beyond dominant demographics toward truly inclusive, pluralistic communities.

Technology Category

Application Category

📝 Abstract

Large Language Models have been shown to demonstrate stereotypical biases in their representations and behavior due to the discriminative nature of the data that they have been trained on. Despite significant progress in the development of methods and models that refrain from using stereotypical information in their decision-making, recent work has shown that approaches used for bias alignment are brittle. In this work, we introduce a novel and general augmentation framework that involves three plug-and-play steps and is applicable to a number of fairness evaluation benchmarks. Through application of augmentation to a fairness evaluation dataset (Bias Benchmark for Question Answering (BBQ)), we find that Large Language Models (LLMs), including state-of-the-art open and closed weight models, are susceptible to perturbations to their inputs, showcasing a higher likelihood to behave stereotypically. Furthermore, we find that such models are more likely to have biased behavior in cases where the target demographic belongs to a community less studied by the literature, underlining the need to expand the fairness and safety research to include more diverse communities.

Problem

Research questions and friction points this paper is trying to address.

Revealing LLM bias susceptibility through minimal contextual input perturbations

Testing bias alignment brittleness across fairness evaluation benchmarks

Identifying increased stereotypical behavior toward underrepresented demographic groups

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces plug-and-play contextual augmentation framework

Applies augmentation to fairness evaluation benchmarks

Tests susceptibility of LLMs to input perturbations

🔎 Similar Papers

Are Large Language Models Really Bias-Free? Jailbreak Prompts for Assessing Adversarial Robustness to Bias Elicitation