🤖 AI Summary
Existing image-level defenses struggle to effectively prevent static images from being exploited to generate deepfake videos. This work systematically uncovers, for the first time, the robustness mechanisms underlying image-to-video (I2V) generative models—specifically, noise dilution and textual guidance override—and introduces a temporally immune defense strategy. The proposed approach injects temporally balanced perturbations into the latent space at the encoder level while aligning intermediate generative representations with precomputed collapse trajectories. Without compromising visual imperceptibility, this method substantially enhances both the strength and persistence of interference against I2V synthesis, significantly outperforming adapted image-level baselines under the same perturbation budget.
📝 Abstract
Image-to-video (I2V) generation has the potential for societal harm because it enables the unauthorized animation of static images to create realistic deepfakes. While existing defenses effectively protect against static image manipulation, extending these to I2V generation remains underexplored and non-trivial. In this paper, we systematically analyze why modern I2V models are highly robust against naive image-level adversarial attacks (i.e., immunization). We observe that the video encoding process rapidly dilutes the adversarial noise across future frames, and the continuous text-conditioned guidance actively overrides the intended disruptive effect of the immunization. Building on these findings, we propose the Immune2V framework which enforces temporally balanced latent divergence at the encoder level to prevent signal dilution, and aligns intermediate generative representations with a precomputed collapse-inducing trajectory to counteract the text-guidance override. Extensive experiments demonstrate that Immune2V produces substantially stronger and more persistent degradation than adapted image-level baselines under the same imperceptibility budget.