🤖 AI Summary
While generated images appear photorealistic to human observers, it remains unclear whether they are equally indistinguishable to neural network classifiers—revealing a potential perception gap between human vision and model discrimination.
Method: We propose a distribution-level discriminability analysis framework, conducting controlled comparative experiments across multiple diffusion architectures (DiT, EDM2, U-ViT), complemented by feature attribution and classifier guidance techniques.
Contributions/Results: (1) State-of-the-art diffusion models still exhibit significant classifier-detectable artifacts; cross-architecture samples are readily distinguishable, whereas same-family models of varying scales remain largely confusable. (2) We pioneer the use of off-the-shelf classifiers as diagnostic tools for generative models, uncovering a “model self-cannibalization imbalance” phenomenon wherein generators overfit to classifier-specific biases. (3) Classifier guidance is empirically shown to meaningfully enhance perceptual realism and provides an interpretable theoretical foundation for classifier-aware data augmentation.
📝 Abstract
The ultimate goal of generative models is to perfectly capture the data distribution. For image generation, common metrics of visual quality (e.g., FID) and the perceived truthfulness of generated images seem to suggest that we are nearing this goal. However, through distribution classification tasks, we reveal that, from the perspective of neural network-based classifiers, even advanced diffusion models are still far from this goal. Specifically, classifiers are able to consistently and effortlessly distinguish real images from generated ones across various settings. Moreover, we uncover an intriguing discrepancy: classifiers can easily differentiate between diffusion models with comparable performance (e.g., U-ViT-H vs. DiT-XL), but struggle to distinguish between models within the same family but of different scales (e.g., EDM2-XS vs. EDM2-XXL). Our methodology carries several important implications. First, it naturally serves as a diagnostic tool for diffusion models by analyzing specific features of generated data. Second, it sheds light on the model autophagy disorder and offers insights into the use of generated data: augmenting real data with generated data is more effective than replacing it. Third, classifier guidance can significantly enhance the realism of generated images.