🤖 AI Summary
Despite safety alignment and external content filters, text-to-image (T2I) models remain vulnerable to generating harmful content. Existing red-teaming methods predominantly rely on per-prompt optimization, suffering from limited scalability and poor semantic diversity. To address this, we propose DREAM—a novel framework that for the first time models adversarial prompts at the *probability distribution level*, jointly optimizing attack effectiveness and semantic diversity. DREAM employs an energy-based model for distributional modeling and introduces GC-SPSA, a gradient-free, zeroth-order optimization algorithm tailored for non-differentiable, long-chain T2I systems, enabling stable gradient estimation and end-to-end optimization. Crucially, it supports efficient, post-training large-scale sampling. Extensive experiments across multiple state-of-the-art T2I models and safety filters demonstrate that DREAM consistently outperforms nine advanced baselines—achieving superior prompt-triggering success rates and unmatched semantic diversity in generated adversarial prompts.
📝 Abstract
Despite the integration of safety alignment and external filters, text-to-image (T2I) generative models are still susceptible to producing harmful content, such as sexual or violent imagery. This raises serious concerns about unintended exposure and potential misuse. Red teaming, which aims to proactively identify diverse prompts that can elicit unsafe outputs from the T2I system (including the core generative model as well as potential external safety filters and other processing components), is increasingly recognized as an essential method for assessing and improving safety before real-world deployment. Yet, existing automated red teaming approaches often treat prompt discovery as an isolated, prompt-level optimization task, which limits their scalability, diversity, and overall effectiveness. To bridge this gap, in this paper, we propose DREAM, a scalable red teaming framework to automatically uncover diverse problematic prompts from a given T2I system. Unlike most prior works that optimize prompts individually, DREAM directly models the probabilistic distribution of the target system's problematic prompts, which enables explicit optimization over both effectiveness and diversity, and allows efficient large-scale sampling after training. To achieve this without direct access to representative training samples, we draw inspiration from energy-based models and reformulate the objective into simple and tractable objectives. We further introduce GC-SPSA, an efficient optimization algorithm that provide stable gradient estimates through the long and potentially non-differentiable T2I pipeline. The effectiveness of DREAM is validated through extensive experiments, demonstrating that it surpasses 9 state-of-the-art baselines by a notable margin across a broad range of T2I models and safety filters in terms of prompt success rate and diversity.