Jailbreaking the Text-to-Video Generative Models

📅 2025-05-10

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

Text-to-video (T2V) generative models are vulnerable to jailbreaking attacks that bypass safety filters, yet existing work lacks systematic, optimization-based attack methodologies. Method: This paper introduces the first optimization-driven prompt attack framework tailored for T2V models, jointly optimizing three objectives—attack success rate, semantic fidelity, and safety filter evasion—guided by an average-score-driven prompt mutation mechanism. The framework integrates gradient-based optimization, CLIP-based semantic modeling, reverse engineering of safety classifiers, and cross-model evaluation. Contribution/Results: Evaluated on leading T2V models—including Open-Sora, Pika, Luma, and Kling—the framework achieves substantially higher attack success rates than state-of-the-art baselines, with semantic similarity improvements up to 37%. This work exposes critical safety vulnerabilities in T2V systems and establishes a novel paradigm for robustness assessment and defense-aware evaluation.

Technology Category

Application Category

📝 Abstract

Text-to-video generative models have achieved significant progress, driven by the rapid advancements in diffusion models, with notable examples including Pika, Luma, Kling, and Sora. Despite their remarkable generation ability, their vulnerability to jailbreak attack, i.e. to generate unsafe content, including pornography, violence, and discrimination, raises serious safety concerns. Existing efforts, such as T2VSafetyBench, have provided valuable benchmarks for evaluating the safety of text-to-video models against unsafe prompts but lack systematic studies for exploiting their vulnerabilities effectively. In this paper, we propose the extit{first} optimization-based jailbreak attack against text-to-video models, which is specifically designed. Our approach formulates the prompt generation task as an optimization problem with three key objectives: (1) maximizing the semantic similarity between the input and generated prompts, (2) ensuring that the generated prompts can evade the safety filter of the text-to-video model, and (3) maximizing the semantic similarity between the generated videos and the original input prompts. To further enhance the robustness of the generated prompts, we introduce a prompt mutation strategy that creates multiple prompt variants in each iteration, selecting the most effective one based on the averaged score. This strategy not only improves the attack success rate but also boosts the semantic relevance of the generated video. We conduct extensive experiments across multiple text-to-video models, including Open-Sora, Pika, Luma, and Kling. The results demonstrate that our method not only achieves a higher attack success rate compared to baseline methods but also generates videos with greater semantic similarity to the original input prompts.

Problem

Research questions and friction points this paper is trying to address.

Investigates jailbreak attacks on text-to-video models generating unsafe content

Lacks systematic studies on effectively exploiting model vulnerabilities

Proposes optimization-based attack to bypass safety filters effectively

Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimization-based jailbreak attack for video models

Prompt mutation strategy enhances attack robustness

Maximizes semantic similarity and evades safety filters

🔎 Similar Papers

Reenact Anything: Semantic Video Motion Transfer Using Motion-Textual Inversion