🤖 AI Summary
This work addresses the issue of social biases in large language models, which can reinforce harmful stereotypes and impede their safe deployment. The authors propose a lightweight debiasing paradigm based on “enhancement rather than suppression”: during inference, a small set of yes/no questions probes bias-related knowledge, attribution analysis identifies key neurons associated with biased responses, and their activations are selectively amplified. This approach requires no model retraining and only minimal data, yet generalizes effectively across diverse bias types and demographic groups. Experimental results demonstrate state-of-the-art debiasing performance across multiple benchmarks and mainstream large language models, with negligible degradation to the models’ general capabilities.
📝 Abstract
Large language models (LLMs) exhibit social biases that reinforce harmful stereotypes, limiting their safe deployment. Most existing debiasing methods adopt a suppressive paradigm by modifying parameters, prompts, or neurons associated with biased behavior; however, such approaches are often brittle, weakly generalizable, data-inefficient, and prone to degrading general capability. We propose \textbf{KnowBias}, a lightweight and conceptually distinct framework that mitigates bias by strengthening, rather than suppressing, neurons encoding bias-knowledge. KnowBias identifies neurons encoding bias knowledge using a small set of bias-knowledge questions via attribution-based analysis, and selectively enhances them at inference time. This design enables strong debiasing while preserving general capabilities, generalizes across bias types and demographics, and is highly data efficient, requiring only a handful of simple yes/no questions and no retraining. Experiments across multiple benchmarks and LLMs demonstrate consistent state-of-the-art debiasing performance with minimal utility degradation. Data and code are available at https://github.com/JP-25/KnowBias.