Jailbreaking on Text-to-Video Models via Scene Splitting Strategy

📅 2025-09-26

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

To address the critical gap in security research on text-to-video (T2V) models, this paper uncovers, for the first time, their vulnerability at the narrative-structure level. We propose SceneSplit, a black-box jailbreaking attack framework that decomposes harmful narratives into semantically benign scene fragments via a scene-splitting strategy; these fragments are then sequenced to induce the model to generate prohibited video content. SceneSplit integrates iterative optimization, reusable policy-library adaptation, and output-space compression to bypass safety filters while enabling semantic emergence—i.e., reconstructing harmful intent from benign sub-scene inputs. Evaluated on Luma Ray2, Hailuo, and Veo2, SceneSplit achieves average attack success rates of 77.2%, 84.1%, and 78.2%, respectively—substantially outperforming existing baselines. Our results demonstrate the method’s effectiveness, cross-model generalizability, and tangible real-world security threat.

Technology Category

Application Category

📝 Abstract

Along with the rapid advancement of numerous Text-to-Video (T2V) models, growing concerns have emerged regarding their safety risks. While recent studies have explored vulnerabilities in models like LLMs, VLMs, and Text-to-Image (T2I) models through jailbreak attacks, T2V models remain largely unexplored, leaving a significant safety gap. To address this gap, we introduce SceneSplit, a novel black-box jailbreak method that works by fragmenting a harmful narrative into multiple scenes, each individually benign. This approach manipulates the generative output space, the abstract set of all potential video outputs for a given prompt, using the combination of scenes as a powerful constraint to guide the final outcome. While each scene individually corresponds to a wide and safe space where most outcomes are benign, their sequential combination collectively restricts this space, narrowing it to an unsafe region and significantly increasing the likelihood of generating a harmful video. This core mechanism is further enhanced through iterative scene manipulation, which bypasses the safety filter within this constrained unsafe region. Additionally, a strategy library that reuses successful attack patterns further improves the attack's overall effectiveness and robustness. To validate our method, we evaluate SceneSplit across 11 safety categories on T2V models. Our results show that it achieves a high average Attack Success Rate (ASR) of 77.2% on Luma Ray2, 84.1% on Hailuo, and 78.2% on Veo2, significantly outperforming the existing baseline. Through this work, we demonstrate that current T2V safety mechanisms are vulnerable to attacks that exploit narrative structure, providing new insights for understanding and improving the safety of T2V models.

Problem

Research questions and friction points this paper is trying to address.

Addressing unexplored safety risks in Text-to-Video models

Developing jailbreak attacks via scene fragmentation strategy

Evaluating vulnerability of T2V safety mechanisms

Innovation

Methods, ideas, or system contributions that make the work stand out.

SceneSplit fragments harmful narratives into benign scenes

Iterative scene manipulation bypasses safety filters

Strategy library reuses successful attack patterns

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs