KnowBias: Mitigating Social Bias in LLMs via Know-Bias Neuron Enhancement

📅 2026-01-29

📈 Citations: 0

✨ Influential: 0

career value

164K/year

🤖 AI Summary

This work addresses the issue of social biases in large language models, which can reinforce harmful stereotypes and impede their safe deployment. The authors propose a lightweight debiasing paradigm based on “enhancement rather than suppression”: during inference, a small set of yes/no questions probes bias-related knowledge, attribution analysis identifies key neurons associated with biased responses, and their activations are selectively amplified. This approach requires no model retraining and only minimal data, yet generalizes effectively across diverse bias types and demographic groups. Experimental results demonstrate state-of-the-art debiasing performance across multiple benchmarks and mainstream large language models, with negligible degradation to the models’ general capabilities.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) exhibit social biases that reinforce harmful stereotypes, limiting their safe deployment. Most existing debiasing methods adopt a suppressive paradigm by modifying parameters, prompts, or neurons associated with biased behavior; however, such approaches are often brittle, weakly generalizable, data-inefficient, and prone to degrading general capability. We propose \textbf{KnowBias}, a lightweight and conceptually distinct framework that mitigates bias by strengthening, rather than suppressing, neurons encoding bias-knowledge. KnowBias identifies neurons encoding bias knowledge using a small set of bias-knowledge questions via attribution-based analysis, and selectively enhances them at inference time. This design enables strong debiasing while preserving general capabilities, generalizes across bias types and demographics, and is highly data efficient, requiring only a handful of simple yes/no questions and no retraining. Experiments across multiple benchmarks and LLMs demonstrate consistent state-of-the-art debiasing performance with minimal utility degradation. Data and code are available at https://github.com/JP-25/KnowBias.

Problem

Research questions and friction points this paper is trying to address.

social bias

large language models

stereotypes

debiasing

harmful bias

Innovation

Methods, ideas, or system contributions that make the work stand out.

bias mitigation

neuron enhancement

attribution-based analysis