Red-Teaming Text-to-Image Systems by Rule-based Preference Modeling

πŸ“… 2025-05-27
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Text-to-image (T2I) models pose risks of generating harmful content, yet existing black-box red-teaming methods rely on prior knowledge of defense mechanisms and thus struggle to adapt to diverse, unknown commercial APIs. This paper proposes RPG-RT, an LLM-driven, iterative red-teaming framework that requires no white-box access or assumptions about underlying defenses. Its core innovation is rule-based preference modeling: it translates coarse-grained safety feedback (e.g., β€œrefusal”) into fine-grained, actionable prompt optimization signals. Integrated with reinforcement feedback loops and systematic prompt engineering, RPG-RT enables dynamic, self-adaptive attacks. Evaluated across 19 open-source T2I models, 3 commercial APIs, and text-to-video (T2V) models, RPG-RT achieves significantly higher attack success rates than state-of-the-art black-box methods, demonstrating strong generalization and practical deployability.

Technology Category

Application Category

πŸ“ Abstract
Text-to-image (T2I) models raise ethical and safety concerns due to their potential to generate inappropriate or harmful images. Evaluating these models' security through red-teaming is vital, yet white-box approaches are limited by their need for internal access, complicating their use with closed-source models. Moreover, existing black-box methods often assume knowledge about the model's specific defense mechanisms, limiting their utility in real-world commercial API scenarios. A significant challenge is how to evade unknown and diverse defense mechanisms. To overcome this difficulty, we propose a novel Rule-based Preference modeling Guided Red-Teaming (RPG-RT), which iteratively employs LLM to modify prompts to query and leverages feedback from T2I systems for fine-tuning the LLM. RPG-RT treats the feedback from each iteration as a prior, enabling the LLM to dynamically adapt to unknown defense mechanisms. Given that the feedback is often labeled and coarse-grained, making it difficult to utilize directly, we further propose rule-based preference modeling, which employs a set of rules to evaluate desired or undesired feedback, facilitating finer-grained control over the LLM's dynamic adaptation process. Extensive experiments on nineteen T2I systems with varied safety mechanisms, three online commercial API services, and T2V models verify the superiority and practicality of our approach.
Problem

Research questions and friction points this paper is trying to address.

Evaluating security of text-to-image models via red-teaming
Overcoming limitations of white-box and black-box methods
Adapting to unknown defense mechanisms dynamically
Innovation

Methods, ideas, or system contributions that make the work stand out.

Rule-based preference modeling for red-teaming
LLM iteratively modifies prompts using feedback
Dynamic adaptation to unknown defense mechanisms
πŸ”Ž Similar Papers
No similar papers found.