TrajShield: Trajectory-Level Safety Mediation for Defending Text-to-Video Models Against Jailbreak Attacks

📅 2026-05-03

📈 Citations: 0

✨ Influential: 0

career value

164K/year

🤖 AI Summary

This work addresses the safety challenges of text-to-video (T2V) models, which are vulnerable not only to explicit harmful prompts and jailbreak attacks but also to implicit risks arising from temporal semantic coherence. The authors propose the first training-free, inference-time defense framework that formulates T2V safety as trajectory-level causal intervention in the temporal semantic space. By simulating the implicit trajectory implied by the input prompt, identifying the source of risk, and applying minimally invasive rewrites, the method provides a unified defense against three major threat categories. Evaluated on T2VSafetyBench across 14 safety dimensions, the approach reduces average attack success rates by 52.44% while preserving high semantic fidelity, significantly outperforming existing defenses.

📝 Abstract

Text-to-Video (T2V) models have demonstrated remarkable capability in generating temporally coherent videos from natural language prompts, yet they also risk producing unsafe content such as violence or explicit material. Existing prompt-level defenses are largely inherited from text-to-image safety and operate on the lexical surface of the input, making them vulnerable to jailbreak attacks that disguise harmful intent through rephrasing or adversarial prompting. Moreover, T2V generation introduces a distinctive challenge overlooked by prior work: temporally emergent risk, where a seemingly benign prompt leads to unsafe content through the generator's temporal extrapolation toward narrative coherence. We propose \method{}, a training-free, inference-time defense framework that reformulates T2V safety as a causal intervention in a temporally structured semantic space. TrajShield handles explicit unsafe prompts, jailbreak attacks, and temporally emergent risks in a unified manner by simulating the implied trajectory of a prompt, localizing the causal origin of potential risk, and applying a minimally invasive rewrite that neutralizes the risk while preserving safety-irrelevant semantics. Experiments on T2VSafetyBench across 14 safety categories and multiple T2V backends demonstrate that TrajShield achieves state-of-the-art defenseive performance while maintaining high semantic fidelity, substantially outperforming existing defenses, with an average ASR reduction of 52.44\%.

Problem

Research questions and friction points this paper is trying to address.

Text-to-Video safety

jailbreak attacks

temporally emergent risk

unsafe content generation

prompt-level defense

Innovation

Methods, ideas, or system contributions that make the work stand out.

TrajShield

text-to-video safety

jailbreak attacks