Bias Testing and Mitigation in Black Box LLMs using Metamorphic Relations

📅 2025-11-29

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

Detecting latent social biases in black-box large language models (LLMs) remains challenging due to their context-sensitive and inconsistent behavioral patterns. Method: This paper introduces the first unified metamorphic testing (MT)-based framework for bias evaluation and mitigation. It defines six semantically equivalent yet more challenging input metamorphic relations to systematically expose inconsistent bias responses under contextual variations—marking the first systematic application of MT principles to LLM bias analysis, enabling both bias detection and adversarial data generation. Contribution/Results: Evaluated on the BiasAsker benchmark across six mainstream LLMs, the method identifies 14% more latent biases than existing tools. Fine-tuning models on generated metamorphic samples increases safety response rates from 54.7% to over 88.9%, establishing a closed-loop test-and-mitigate optimization pipeline.

Technology Category

Application Category

📝 Abstract

The widespread deployment of Large Language Models (LLMs) has intensified concerns about subtle social biases embedded in their outputs. Existing guardrails often fail when faced with indirect or contextually complex bias-inducing prompts. To address these limitations, we propose a unified framework for both systematic bias evaluation and targeted mitigation. Our approach introduces six novel Metamorphic Relations (MRs) that, based on metamorphic testing principles, transform direct bias-inducing inputs into semantically equivalent yet adversarially challenging variants. These transformations enable an automated method for exposing hidden model biases: when an LLM responds inconsistently or unfairly across MR-generated variants, the underlying bias becomes detectable. We further show that the same MRs can be used to generate diverse bias-inducing samples for fine-tuning, directly linking the testing process to mitigation. Using six state-of-the-art LLMs - spanning open-source and proprietary models - and a representative subset of 385 questions from the 8,978-item BiasAsker benchmark covering seven protected groups, our MRs reveal up to 14% more hidden biases compared to existing tools. Moreover, fine-tuning with both original and MR-mutated samples significantly enhances bias resiliency, increasing safe response rates from 54.7% to over 88.9% across models. These results highlight metamorphic relations as a practical mechanism for improving fairness in conversational AI.

Problem

Research questions and friction points this paper is trying to address.

Detecting hidden social biases in black-box LLMs using metamorphic testing

Transforming direct bias prompts into challenging variants to expose model inconsistencies

Using metamorphic relations for both bias evaluation and mitigation through fine-tuning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Metamorphic relations transform inputs to expose hidden biases

Automated bias detection via inconsistent model responses to variants

Same relations generate fine-tuning samples for bias mitigation

🔎 Similar Papers

LangBiTe: A Platform for Testing Bias in Large Language Models