🤖 AI Summary
To address the growing challenge of tracing AI-generated videos amid their proliferation, this paper introduces the first large-scale, fine-grained video attribution framework, supporting five-level溯源: authenticity, generation task, model architecture, version, and developer team. Methodologically, we propose Temporal Signature Maps (T-Sigs), a novel time-aware feature visualization technique that uncovers discriminative spatiotemporal artifacts unique to different generators. We extract robust spatiotemporal features using a video Transformer augmented with a resilient vision foundation model. Our framework adopts a pretraining-plus-lightweight-attribution paradigm, achieving full-supervision-level performance with only 0.5% labeled data, further enhanced by cross-domain adaptive learning for improved generalization. Extensive experiments on multiple public benchmarks demonstrate substantial gains over state-of-the-art methods in accuracy, interpretability, and cross-domain applicability—establishing a critical technical foundation for digital forensics and AI content governance.
📝 Abstract
The proliferation of generative AI has led to hyper-realistic synthetic videos, escalating misuse risks and outstripping binary real/fake detectors. We introduce SAGA (Source Attribution of Generative AI videos), the first comprehensive framework to address the urgent need for AI-generated video source attribution at a large scale. Unlike traditional detection, SAGA identifies the specific generative model used. It uniquely provides multi-granular attribution across five levels: authenticity, generation task (e.g., T2V/I2V), model version, development team, and the precise generator, offering far richer forensic insights. Our novel video transformer architecture, leveraging features from a robust vision foundation model, effectively captures spatio-temporal artifacts. Critically, we introduce a data-efficient pretrain-and-attribute strategy, enabling SAGA to achieve state-of-the-art attribution using only 0.5% of source-labeled data per class, matching fully supervised performance. Furthermore, we propose Temporal Attention Signatures (T-Sigs), a novel interpretability method that visualizes learned temporal differences, offering the first explanation for why different video generators are distinguishable. Extensive experiments on public datasets, including cross-domain scenarios, demonstrate that SAGA sets a new benchmark for synthetic video provenance, providing crucial, interpretable insights for forensic and regulatory applications.