🤖 AI Summary
This work addresses the provenance attribution of images generated by text-to-image (T2I) diffusion models. We propose a novel attribution method grounded in mid-level visual representations—specifically style and structural features—and conduct a systematic evaluation of source identifiability across 12 state-of-the-art T2I models. Contrary to common assumptions, we find that high-frequency details are not decisive for attribution; instead, mid-level style representations substantially outperform raw RGB inputs. Moreover, subtle generation differences—such as initialization seeds—are highly detectable. Our methodology integrates high-frequency perturbation analysis, ablation studies, and cross-model generalization evaluation. On a multi-model benchmark, the approach achieves high attribution accuracy, demonstrating that multi-granularity visual cues collectively enable reliable and interpretable provenance tracing. This provides a robust, explainable technical foundation for AIGC content governance.
📝 Abstract
Modern text-to-image (T2I) diffusion models can generate images with remarkable realism and creativity. These advancements have sparked research in fake image detection and attribution, yet prior studies have not fully explored the practical and scientific dimensions of this task. In addition to attributing images to 12 state-of-the-art T2I generators, we provide extensive analyses on what inference stage hyperparameters and image modifications are discernible. Our experiments reveal that initialization seeds are highly detectable, along with other subtle variations in the image generation process to some extent. We further investigate what visual traces are leveraged in image attribution by perturbing high-frequency details and employing midlevel representations of image style and structure. Notably, altering high-frequency information causes only slight reductions in accuracy, and training an attributor on style representations outperforms training on RGB images. Our analyses underscore that fake images are detectable and attributable at various levels of visual granularity.