Model-Agnostic Gender Bias Control for Text-to-Image Generation via Sparse Autoencoder

📅 2025-07-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Text-to-image diffusion models often exhibit strong spurious correlations between occupations and gender, perpetuating harmful stereotypes. To address this, we propose SAE Debias—a lightweight, model-agnostic, and fine-tuning-free debiasing framework. It is the first to apply k-sparse autoencoders (k-SAEs) to T2I models, enabling unsupervised discovery of interpretable gender bias directions in the pre-trained latent space. Leveraging these directions, SAE Debias constructs occupation-specific bias subspaces and performs feature-level intervention during inference. Evaluated on Stable Diffusion variants, our method significantly mitigates gender bias—e.g., correcting skewed gender distributions for “nurse” and “programmer”—while preserving generation fidelity and diversity. This work establishes a plug-and-play, reproducible, and attributable debiasing paradigm for controllable and equitable generative AI.

Technology Category

Application Category

📝 Abstract
Text-to-image (T2I) diffusion models often exhibit gender bias, particularly by generating stereotypical associations between professions and gendered subjects. This paper presents SAE Debias, a lightweight and model-agnostic framework for mitigating such bias in T2I generation. Unlike prior approaches that rely on CLIP-based filtering or prompt engineering, which often require model-specific adjustments and offer limited control, SAE Debias operates directly within the feature space without retraining or architectural modifications. By leveraging a k-sparse autoencoder pre-trained on a gender bias dataset, the method identifies gender-relevant directions within the sparse latent space, capturing professional stereotypes. Specifically, a biased direction per profession is constructed from sparse latents and suppressed during inference to steer generations toward more gender-balanced outputs. Trained only once, the sparse autoencoder provides a reusable debiasing direction, offering effective control and interpretable insight into biased subspaces. Extensive evaluations across multiple T2I models, including Stable Diffusion 1.4, 1.5, 2.1, and SDXL, demonstrate that SAE Debias substantially reduces gender bias while preserving generation quality. To the best of our knowledge, this is the first work to apply sparse autoencoders for identifying and intervening in gender bias within T2I models. These findings contribute toward building socially responsible generative AI, providing an interpretable and model-agnostic tool to support fairness in text-to-image generation.
Problem

Research questions and friction points this paper is trying to address.

Mitigates gender bias in text-to-image generation models
Identifies gender-relevant directions using sparse autoencoder
Provides model-agnostic debiasing without retraining or modifications
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses sparse autoencoder for gender bias control
Operates directly in feature space without retraining
Suppresses biased directions during inference
🔎 Similar Papers
No similar papers found.