🤖 AI Summary
This work addresses the challenge of video provenance for generative models, which typically relies on large sample sizes or additional training. We propose the first few-shot, training-free video provenance method that leverages a sliding window to perform both normal and destructive reconstructions of video patches. By analyzing the discrepancy in reconstruction losses as a provenance signal and incorporating a temporal mapping from pixel-space frames to latent-space frames, our approach effectively captures intrinsic artifacts left by generative models. Evaluated on five state-of-the-art video generation models, the method achieves an average provenance accuracy exceeding 90% with only 20 samples per model. Notably, it enables zero-shot provenance for HunyuanVideo, EasyAnimate, and Wan2.2—demonstrating unprecedented efficiency and generalization capability without any model-specific adaptation.
📝 Abstract
Recent advancements in video generation technologies have been significant, resulting in their widespread application across multiple domains. However, concerns have been mounting over the potential misuse of generated content. Tracing the origin of generated videos has become crucial to mitigate potential misuse and identify responsible parties. Existing video attribution methods require additional operations or the training of source attribution models, which may degrade video quality or necessitate large amounts of training samples. To address these challenges, we define for the first time the"few-shot training-free generated video attribution"task and propose SWIFT, which is tightly integrated with the temporal characteristics of the video. By leveraging the"Pixel Frames(many) to Latent Frame(one)"temporal mapping within each video chunk, SWIFT applies a fixed-length sliding window to perform two distinct reconstructions: normal and corrupted. The variation in the losses between two reconstructions is then used as an attribution signal. We conducted an extensive evaluation of five state-of-the-art (SOTA) video generation models. Experimental results show that SWIFT achieves over 90% average attribution accuracy with merely 20 video samples across all models and even enables zero-shot attribution for HunyuanVideo, EasyAnimate, and Wan2.2. Our source code is available at https://github.com/wangchao0708/SWIFT.