BiasFilter: An Inference-Time Debiasing Framework for Large Language Models

📅 2025-05-28

📈 Citations: 0

✨ Influential: 0

career value

153K/year

🤖 AI Summary

Existing debiasing methods for large language models (LLMs) rely on fine-tuning, incur high computational costs, suffer from poor generalization, and fail to adapt effectively to open-ended generation tasks, leaving societal biases inadequately mitigated during inference. Method: We propose a fine-tuning-free, model-agnostic inference-time debiasing framework. Its core components are: (1) a novel token-level fairness reward signal enabling dynamic candidate set pruning; and (2) the first fairness-oriented preference dataset, used to train an implicit reward model that captures fairness without explicit annotations. During generation, the framework filters biased subsequences in real time while maintaining and adaptively pruning multi-step candidates. Results: Our method reduces societal bias by 42% on average across multiple mainstream LLMs—significantly outperforming baselines—while preserving fluency and factual consistency. It requires zero parameter modification and is plug-and-play.

Technology Category

Application Category

📝 Abstract

Mitigating social bias in large language models (LLMs) has become an increasingly important research objective. However, existing debiasing methods often incur high human and computational costs, exhibit limited effectiveness, and struggle to scale to larger models and open-ended generation tasks. To address these limitations, this paper proposes BiasFilter, a model-agnostic, inference-time debiasing framework that integrates seamlessly with both open-source and API-based LLMs. Instead of relying on retraining with balanced data or modifying model parameters, BiasFilter enforces fairness by filtering generation outputs in real time. Specifically, it periodically evaluates intermediate outputs every few tokens, maintains an active set of candidate continuations, and incrementally completes generation by discarding low-reward segments based on a fairness reward signal. To support this process, we construct a fairness preference dataset and train an implicit reward model to assess token-level fairness in generated responses. Extensive experiments demonstrate that BiasFilter effectively mitigates social bias across a range of LLMs while preserving overall generation quality.

Problem

Research questions and friction points this paper is trying to address.

Mitigating social bias in large language models

Reducing human and computational costs in debiasing

Maintaining generation quality while ensuring fairness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Inference-time debiasing without retraining

Real-time fairness filtering of outputs

Token-level fairness reward model

🔎 Similar Papers

From Prejudice to Parity: A New Approach to Debiasing Large Language Model Word Embeddings