LLM Bias Detection and Mitigation through the Lens of Desired Distributions

📅 2025-10-07

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

This paper addresses systematic biases in large language model (LLM) outputs—deviations from desired target distributions, such as real-world or egalitarian distributions. It formally defines bias as statistical divergence between the model’s output probability distribution and a specified target distribution, and proposes Weighted Adaptive Loss Fine-tuning (WALF), a novel fine-tuning method that achieves distributional alignment while preserving core language modeling capabilities. WALF unifies factual alignment and social fairness objectives within a single framework. Evaluated on three occupation-based benchmark sets derived from U.S. Bureau of Labor Statistics data, WALF enables near-perfect gender–occupation parity in masked language models and reduces bias by 30%–75% under real-world distribution constraints. When applied to Llama-2-Instruct, it achieves 50%–62% bias reduction in realistic scenarios—outperforming existing debiasing approaches.

Technology Category

Application Category

📝 Abstract

Although prior work on bias mitigation has focused on promoting social equality and demographic parity, less attention has been given to aligning LLM's outputs to desired distributions. For example, we might want to align a model with real-world distributions to support factual grounding. Thus, we define bias as deviation from a desired distribution, which may be an equal or real-world distribution, depending on application goals. We propose a weighted adaptive loss based fine-tuning method that aligns LLM's gender-profession output distribution with the desired distribution, while preserving language modeling capability. Using 3 profession sets -- male-dominated, female-dominated, and gender-balanced -- derived from U.S. labor statistics (2024), we assess both our adaptive method for reflecting reality and a non-adaptive variant for equality. Across three masked language models, bias is observed under both distributions. We achieve near-complete mitigation under equality and 30-75% reduction under real-world settings. Autoregressive LLMs show no bias under equality but notable bias under real-world settings, with the Llama Instruct models (3.2-3B, 3.1-8B) achieving a 50-62% reduction.

Problem

Research questions and friction points this paper is trying to address.

Detecting LLM bias as deviation from desired distributions

Aligning gender-profession outputs with equal or real-world distributions

Mitigating bias while preserving language modeling capabilities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Aligns LLM outputs with desired distributions

Uses weighted adaptive loss fine-tuning method

Reduces bias under equality and real-world settings

🔎 Similar Papers

No similar papers found.