Improving Fairness in LLMs Through Testing-Time Adversaries

📅 2025-05-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) face reliability challenges in ethically sensitive tasks due to latent societal biases. This paper proposes a purely test-time fairness enhancement method that requires no fine-tuning, training data, or parameter updates—only forward passes during inference. It injects lightweight, semantics-preserving attribute perturbations (e.g., race/gender substitutions) to generate attribute variants and evaluates response consistency across them. Bias is quantified using multidimensional fairness metrics, including equal opportunity difference and statistical parity difference. The approach introduces the novel “perturb–compare–attribute” paradigm, eliminating reliance on distributional priors or model modifications. Experiments on Llama3 demonstrate up to a 27-percentage-point improvement in race-related fairness metrics, substantially mitigating inter-group prediction disparities. The method achieves zero computational overhead and plug-and-play fairness enhancement.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) push the bound-aries in natural language processing and generative AI, driving progress across various aspects of modern society. Unfortunately, the pervasive issue of bias in LLMs responses (i.e., predictions) poses a significant and open challenge, hindering their application in tasks involving ethical sensitivity and responsible decision-making. In this work, we propose a straightforward, user-friendly and practical method to mitigate such biases, enhancing the reliability and trustworthiness of LLMs. Our method creates multiple variations of a given sentence by modifying specific attributes and evaluates the corresponding prediction behavior compared to the original, unaltered, prediction/sentence. The idea behind this process is that critical ethical predictions often exhibit notable inconsistencies, indicating the presence of bias. Unlike previous approaches, our method relies solely on forward passes (i.e., testing-time adversaries), eliminating the need for training, fine-tuning, or prior knowledge of the training data distribution. Through extensive experiments on the popular Llama family, we demonstrate the effectiveness of our method in improving various fairness metrics, focusing on the reduction of disparities in how the model treats individuals from different racial groups. Specifically, using standard metrics, we improve the fairness in Llama3 in up to 27 percentage points. Overall, our approach significantly enhances fairness, equity, and reliability in LLM-generated results without parameter tuning or training data modifications, confirming its effectiveness in practical scenarios. We believe our work establishes an important step toward enabling the use of LLMs in tasks that require ethical considerations and responsible decision-making.
Problem

Research questions and friction points this paper is trying to address.

Mitigating bias in LLM responses to enhance fairness
Reducing racial disparities in LLM predictions without training
Improving ethical consistency in LLM outputs via testing-time adversaries
Innovation

Methods, ideas, or system contributions that make the work stand out.

Testing-time adversaries modify sentence attributes
Forward passes detect bias without training
Improves fairness metrics without parameter tuning
🔎 Similar Papers
No similar papers found.