Interpretable Debiasing of Vision-Language Models for Social Fairness

📅 2026-02-27

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

This work addresses the challenge of social bias in vision-language models (VLMs), whose opaque reasoning processes can inadvertently encode and amplify societal prejudices. Existing debiasing approaches often operate on superficial signals and fail to account for internal model mechanisms. To overcome this limitation, we propose DeBiasLens, a novel framework that—without requiring annotated social attribute labels—leverages sparse autoencoders to identify neurons within multimodal encoders that exhibit heightened activation in response to specific demographic groups, including minorities. By selectively suppressing these neurons, our method enables interpretable identification and mitigation of social biases while preserving the model’s core semantic understanding capabilities. DeBiasLens thus offers a label-free, mechanism-aware tool for auditing and enhancing fairness in AI systems.

Technology Category

Application Category

📝 Abstract

The rapid advancement of Vision-Language models (VLMs) has raised growing concerns that their black-box reasoning processes could lead to unintended forms of social bias. Current debiasing approaches focus on mitigating surface-level bias signals through post-hoc learning or test-time algorithms, while leaving the internal dynamics of the model largely unexplored. In this work, we introduce an interpretable, model-agnostic bias mitigation framework, DeBiasLens, that localizes social attribute neurons in VLMs through sparse autoencoders (SAEs) applied to multimodal encoders. Building upon the disentanglement ability of SAEs, we train them on facial image or caption datasets without corresponding social attribute labels to uncover neurons highly responsive to specific demographics, including those that are underrepresented. By selectively deactivating the social neurons most strongly tied to bias for each group, we effectively mitigate socially biased behaviors of VLMs without degrading their semantic knowledge. Our research lays the groundwork for future auditing tools, prioritizing social fairness in emerging real-world AI systems.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language Models

Social Bias

Interpretability

Debiasing

Fairness

Innovation

Methods, ideas, or system contributions that make the work stand out.

interpretable debiasing

vision-language models

sparse autoencoders