🤖 AI Summary
Existing text-to-image (T2I) robustness evaluations suffer from limited adversarial prompt scale, insufficient cultural diversity, and low realism. To address these issues, this paper proposes Seed2Harvest—a human-AI collaborative framework that leverages manually crafted, high-quality, culturally diverse adversarial prompts as seeds and augments them via AI-driven, semantics-preserving data expansion. This enables scalable, high-fidelity prompt generation while preserving human-designed attack patterns. The expanded prompt set achieves broad geographic coverage (535 unique regions) and high information entropy (7.48). When evaluated on mainstream safety detectors—NudeNet, SD NSFW, and Q16—the augmented prompt set maintains stable attack success rates (0.31, 0.36, and 0.12, respectively). Seed2Harvest thus establishes the first benchmark for multimodal red-teaming that simultaneously satisfies scalability, cultural diversity, and real-world relevance.
📝 Abstract
Text-to-image (T2I) models have become prevalent across numerous applications, making their robust evaluation against adversarial attacks a critical priority. Continuous access to new and challenging adversarial prompts across diverse domains is essential for stress-testing these models for resilience against novel attacks from multiple vectors. Current techniques for generating such prompts are either entirely authored by humans or synthetically generated. On the one hand, datasets of human-crafted adversarial prompts are often too small in size and imbalanced in their cultural and contextual representation. On the other hand, datasets of synthetically-generated prompts achieve scale, but typically lack the realistic nuances and creative adversarial strategies found in human-crafted prompts. To combine the strengths of both human and machine approaches, we propose Seed2Harvest, a hybrid red-teaming method for guided expansion of culturally diverse, human-crafted adversarial prompt seeds. The resulting prompts preserve the characteristics and attack patterns of human prompts while maintaining comparable average attack success rates (0.31 NudeNet, 0.36 SD NSFW, 0.12 Q16). Our expanded dataset achieves substantially higher diversity with 535 unique geographic locations and a Shannon entropy of 7.48, compared to 58 locations and 5.28 entropy in the original dataset. Our work demonstrates the importance of human-machine collaboration in leveraging human creativity and machine computational capacity to achieve comprehensive, scalable red-teaming for continuous T2I model safety evaluation.