Improving Fairness in LLMs Through Testing-Time Adversaries

📅 2025-05-17

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

Large language models (LLMs) face reliability challenges in ethically sensitive tasks due to latent societal biases. This paper proposes a purely test-time fairness enhancement method that requires no fine-tuning, training data, or parameter updates—only forward passes during inference. It injects lightweight, semantics-preserving attribute perturbations (e.g., race/gender substitutions) to generate attribute variants and evaluates response consistency across them. Bias is quantified using multidimensional fairness metrics, including equal opportunity difference and statistical parity difference. The approach introduces the novel “perturb–compare–attribute” paradigm, eliminating reliance on distributional priors or model modifications. Experiments on Llama3 demonstrate up to a 27-percentage-point improvement in race-related fairness metrics, substantially mitigating inter-group prediction disparities. The method achieves zero computational overhead and plug-and-play fairness enhancement.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) push the bound-aries in natural language processing and generative AI, driving progress across various aspects of modern society. Unfortunately, the pervasive issue of bias in LLMs responses (i.e., predictions) poses a significant and open challenge, hindering their application in tasks involving ethical sensitivity and responsible decision-making. In this work, we propose a straightforward, user-friendly and practical method to mitigate such biases, enhancing the reliability and trustworthiness of LLMs. Our method creates multiple variations of a given sentence by modifying specific attributes and evaluates the corresponding prediction behavior compared to the original, unaltered, prediction/sentence. The idea behind this process is that critical ethical predictions often exhibit notable inconsistencies, indicating the presence of bias. Unlike previous approaches, our method relies solely on forward passes (i.e., testing-time adversaries), eliminating the need for training, fine-tuning, or prior knowledge of the training data distribution. Through extensive experiments on the popular Llama family, we demonstrate the effectiveness of our method in improving various fairness metrics, focusing on the reduction of disparities in how the model treats individuals from different racial groups. Specifically, using standard metrics, we improve the fairness in Llama3 in up to 27 percentage points. Overall, our approach significantly enhances fairness, equity, and reliability in LLM-generated results without parameter tuning or training data modifications, confirming its effectiveness in practical scenarios. We believe our work establishes an important step toward enabling the use of LLMs in tasks that require ethical considerations and responsible decision-making.

Problem

Research questions and friction points this paper is trying to address.

Mitigating bias in LLM responses to enhance fairness

Reducing racial disparities in LLM predictions without training

Improving ethical consistency in LLM outputs via testing-time adversaries

Innovation

Methods, ideas, or system contributions that make the work stand out.

Testing-time adversaries modify sentence attributes

Forward passes detect bias without training

Improves fairness metrics without parameter tuning

🔎 Similar Papers

Are Large Language Models Really Bias-Free? Jailbreak Prompts for Assessing Adversarial Robustness to Bias Elicitation