🤖 AI Summary
Text-to-image (TTI) diffusion models exhibit high visual fidelity but harbor implicit societal biases along sensitive attributes such as gender, race, and age. Existing debiasing approaches rely on manually curated or LLM-generated prompt sets, suffering from limited coverage and inability to uncover unforeseen bias-triggering prompts. This paper proposes Bias-Guided Prompt Search (BGPS), the first framework enabling interpretable, user-controllable, and high-coverage bias detection. BGPS leverages attribute classifiers to guide an LLM in searching for natural-language prompts—within a neutral prompt space—that maximally amplify bias. It jointly exploits internal TTI model representations and LLM-generated feedback, refining the search via decoding-level control. Evaluated on Stable Diffusion 1.5 and state-of-the-art debiased models, BGPS successfully uncovers diverse latent biases, significantly degrading fairness metrics while demonstrating superior prompt interpretability and search efficacy.
📝 Abstract
Text-to-image (TTI) diffusion models have achieved remarkable visual quality, yet they have been repeatedly shown to exhibit social biases across sensitive attributes such as gender, race and age. To mitigate these biases, existing approaches frequently depend on curated prompt datasets - either manually constructed or generated with large language models (LLMs) - as part of their training and/or evaluation procedures. Beside the curation cost, this also risks overlooking unanticipated, less obvious prompts that trigger biased generation, even in models that have undergone debiasing. In this work, we introduce Bias-Guided Prompt Search (BGPS), a framework that automatically generates prompts that aim to maximize the presence of biases in the resulting images. BGPS comprises two components: (1) an LLM instructed to produce attribute-neutral prompts and (2) attribute classifiers acting on the TTI's internal representations that steer the decoding process of the LLM toward regions of the prompt space that amplify the image attributes of interest. We conduct extensive experiments on Stable Diffusion 1.5 and a state-of-the-art debiased model and discover an array of subtle and previously undocumented biases that severely deteriorate fairness metrics. Crucially, the discovered prompts are interpretable, i.e they may be entered by a typical user, quantitatively improving the perplexity metric compared to a prominent hard prompt optimization counterpart. Our findings uncover TTI vulnerabilities, while BGPS expands the bias search space and can act as a new evaluation tool for bias mitigation.