BiasFilter: An Inference-Time Debiasing Framework for Large Language Models

📅 2025-05-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing debiasing methods for large language models (LLMs) rely on fine-tuning, incur high computational costs, suffer from poor generalization, and fail to adapt effectively to open-ended generation tasks, leaving societal biases inadequately mitigated during inference. Method: We propose a fine-tuning-free, model-agnostic inference-time debiasing framework. Its core components are: (1) a novel token-level fairness reward signal enabling dynamic candidate set pruning; and (2) the first fairness-oriented preference dataset, used to train an implicit reward model that captures fairness without explicit annotations. During generation, the framework filters biased subsequences in real time while maintaining and adaptively pruning multi-step candidates. Results: Our method reduces societal bias by 42% on average across multiple mainstream LLMs—significantly outperforming baselines—while preserving fluency and factual consistency. It requires zero parameter modification and is plug-and-play.

Technology Category

Application Category

📝 Abstract
Mitigating social bias in large language models (LLMs) has become an increasingly important research objective. However, existing debiasing methods often incur high human and computational costs, exhibit limited effectiveness, and struggle to scale to larger models and open-ended generation tasks. To address these limitations, this paper proposes BiasFilter, a model-agnostic, inference-time debiasing framework that integrates seamlessly with both open-source and API-based LLMs. Instead of relying on retraining with balanced data or modifying model parameters, BiasFilter enforces fairness by filtering generation outputs in real time. Specifically, it periodically evaluates intermediate outputs every few tokens, maintains an active set of candidate continuations, and incrementally completes generation by discarding low-reward segments based on a fairness reward signal. To support this process, we construct a fairness preference dataset and train an implicit reward model to assess token-level fairness in generated responses. Extensive experiments demonstrate that BiasFilter effectively mitigates social bias across a range of LLMs while preserving overall generation quality.
Problem

Research questions and friction points this paper is trying to address.

Mitigating social bias in large language models
Reducing human and computational costs in debiasing
Maintaining generation quality while ensuring fairness
Innovation

Methods, ideas, or system contributions that make the work stand out.

Inference-time debiasing without retraining
Real-time fairness filtering of outputs
Token-level fairness reward model
🔎 Similar Papers
No similar papers found.
X
Xiaoqing Cheng
Zhengzhou University
Ruizhe Chen
Ruizhe Chen
Zhejiang University
LLMMLLM
H
Hongying Zan
Zhengzhou University
Yuxiang Jia
Yuxiang Jia
Zhengzhou University
Natural Language Processing
M
Min Peng
Wuhan University