BiasGym: Fantastic Biases and How to Find (and Remove) Them

📅 2025-08-12

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This paper addresses the challenge of systematically identifying and intervening in implicit biases within large language models (LLMs). We propose BiasInject-BiasScope, a two-component framework: BiasInject enables controllable, token-level bias injection into frozen LLMs via lightweight fine-tuning, establishing a reproducible bias analysis environment; BiasScope leverages concept association analysis and activation steering to localize biases, detect their generalization across contexts, and perform targeted debiasing. The framework ensures safety, interpretability, and parameter efficiency. Experiments demonstrate that it effectively mitigates real-world nationality-related stereotypes (e.g., “people from X country are reckless drivers”) and detects spurious attribute associations (e.g., “people from X country have blue skin”). Crucially, downstream task performance remains stable post-debiasing, confirming minimal utility trade-offs.

Technology Category

Application Category

📝 Abstract

Understanding biases and stereotypes encoded in the weights of Large Language Models (LLMs) is crucial for developing effective mitigation strategies. Biased behaviour is often subtle and non-trivial to isolate, even when deliberately elicited, making systematic analysis and debiasing particularly challenging. To address this, we introduce BiasGym, a simple, cost-effective, and generalizable framework for reliably injecting, analyzing, and mitigating conceptual associations within LLMs. BiasGym consists of two components: BiasInject, which injects specific biases into the model via token-based fine-tuning while keeping the model frozen, and BiasScope, which leverages these injected signals to identify and steer the components responsible for biased behavior. Our method enables consistent bias elicitation for mechanistic analysis, supports targeted debiasing without degrading performance on downstream tasks, and generalizes to biases unseen during training. We demonstrate the effectiveness of BiasGym in reducing real-world stereotypes (e.g., people from a country being `reckless drivers') and in probing fictional associations (e.g., people from a country having `blue skin'), showing its utility for both safety interventions and interpretability research.

Problem

Research questions and friction points this paper is trying to address.

Identifying subtle biases in LLM weights systematically

Injecting and analyzing biases without performance degradation

Mitigating real-world and fictional stereotypes in models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Token-based fine-tuning injects biases

Leverages injected signals to identify bias

Targeted debiasing without performance degradation

🔎 Similar Papers

LangBiTe: A Platform for Testing Bias in Large Language Models