T2I-RiskyPrompt: A Benchmark for Safety Evaluation, Attack, and Defense on Text-to-Image Model

📅 2025-10-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing T2I safety evaluation datasets suffer from narrow risk coverage, coarse-grained annotations, and low prompt effectiveness. To address these limitations, this work introduces the first fine-grained safety benchmark specifically designed for text-to-image (T2I) models. We propose a six-category, fourteen-dimensional hierarchical risk taxonomy and develop a rationale-driven harmful image detection method. Our data construction framework integrates hierarchical classification, multi-source prompt collection, and MLLM-based safety alignment. We release a high-effectiveness corpus of 6,432 adversarial prompts and conduct systematic evaluations across eight mainstream T2I models and nine defense mechanisms, uncovering nine critical safety principles. The benchmark significantly improves both risk identification accuracy and attribution interpretability, establishing a standardized, scalable paradigm for T2I safety assessment.

Technology Category

Application Category

📝 Abstract
Using risky text prompts, such as pornography and violent prompts, to test the safety of text-to-image (T2I) models is a critical task. However, existing risky prompt datasets are limited in three key areas: 1) limited risky categories, 2) coarse-grained annotation, and 3) low effectiveness. To address these limitations, we introduce T2I-RiskyPrompt, a comprehensive benchmark designed for evaluating safety-related tasks in T2I models. Specifically, we first develop a hierarchical risk taxonomy, which consists of 6 primary categories and 14 fine-grained subcategories. Building upon this taxonomy, we construct a pipeline to collect and annotate risky prompts. Finally, we obtain 6,432 effective risky prompts, where each prompt is annotated with both hierarchical category labels and detailed risk reasons. Moreover, to facilitate the evaluation, we propose a reason-driven risky image detection method that explicitly aligns the MLLM with safety annotations. Based on T2I-RiskyPrompt, we conduct a comprehensive evaluation of eight T2I models, nine defense methods, five safety filters, and five attack strategies, offering nine key insights into the strengths and limitations of T2I model safety. Finally, we discuss potential applications of T2I-RiskyPrompt across various research fields. The dataset and code are provided in https://github.com/datar001/T2I-RiskyPrompt.
Problem

Research questions and friction points this paper is trying to address.

Evaluates text-to-image model safety using risky prompts
Addresses limitations in existing risky prompt datasets
Provides comprehensive benchmark for safety evaluation and defense
Innovation

Methods, ideas, or system contributions that make the work stand out.

Developed hierarchical risk taxonomy for T2I models
Created pipeline for collecting annotated risky prompts
Proposed reason-driven detection method aligning MLLM
🔎 Similar Papers
No similar papers found.
C
Chenyu Zhang
School of New Media and Communication, Tianjin University, Tianjin, China
T
Tairen Zhang
Medical School of Tianjin University, Tianjin, China
L
Lanjun Wang
School of New Media and Communication, Tianjin University, Tianjin, China
R
Ruidong Chen
School of Electrical and Information Engineering, Tianjin University, Tianjin, China
Wenhui Li
Wenhui Li
National Institute of Biological Sciences,Beijing
A
Anan Liu
School of Electrical and Information Engineering, Tianjin University, Tianjin, China