Anti-I2V: Safeguarding your photos from malicious image-to-video generation

📅 2026-03-25

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

This work addresses the growing threat of maliciously generated synthetic human videos produced by image-to-video (I2V) diffusion models such as DiT. To counter this, we propose Anti-I2V, a universal defense framework applicable across diverse diffusion architectures. Departing from conventional RGB-space perturbations, Anti-I2V jointly injects adversarial noise in the Lab color space and frequency domain. By analyzing the denoising trajectory, it identifies semantically sensitive layers and tailors the adversarial training objective accordingly. Anti-I2V represents the first systematic defense against Diffusion Transformer–based I2V models, achieving state-of-the-art performance across multiple advanced video diffusion models. It significantly degrades both temporal consistency and visual fidelity of generated videos, thereby mitigating their potential for misuse.

Technology Category

Application Category

📝 Abstract

Advances in diffusion-based video generation models, while significantly improving human animation, poses threats of misuse through the creation of fake videos from a specific person's photo and text prompts. Recent efforts have focused on adversarial attacks that introduce crafted perturbations to protect images from diffusion-based models. However, most existing approaches target image generation, while relatively few explicitly address image-to-video diffusion models (VDMs), and most primarily focus on UNet-based architectures. Hence, their effectiveness against Diffusion Transformer (DiT) models remains largely under-explored, as these models demonstrate improved feature retention, and stronger temporal consistency due to larger capacity and advanced attention mechanisms. In this work, we introduce Anti-I2V, a novel defense against malicious human image-to-video generation, applicable across diverse diffusion backbones. Instead of restricting noise updates to the RGB space, Anti-I2V operates in both the $L$*$a$*$b$* and frequency domains, improving robustness and concentrating on salient pixels. We then identify the network layers that capture the most distinct semantic features during the denoising process to design appropriate training objectives that maximize degradation of temporal coherence and generation fidelity. Through extensive validation, Anti-I2V demonstrates state-of-the-art defense performance against diverse video diffusion models, offering an effective solution to the problem.

Problem

Research questions and friction points this paper is trying to address.

image-to-video generation

diffusion models

adversarial defense

Diffusion Transformer

temporal coherence

Innovation

Methods, ideas, or system contributions that make the work stand out.

Anti-I2V

image-to-video diffusion models

adversarial defense