🤖 AI Summary
This work addresses the issue of global structural inconsistency in diffusion models, which often arises from the locality of their denoising mechanisms. To mitigate this, the authors propose a sparsely supervised learning approach that masks up to 98% of pixels during training, retaining only critical contextual information to guide globally coherent generation. The method is efficiently implementable with minimal code modifications, fully compatible with standard denoising objectives, and substantially reduces pixel-level supervision while alleviating training instability on small datasets. Experimental results demonstrate that the proposed framework achieves competitive FID scores across multiple benchmarks, effectively suppresses overfitting and memorization, and significantly enhances the global consistency of generated images.
📝 Abstract
Diffusion models have shown remarkable success across a wide range of generative tasks. However, they often suffer from spatially inconsistent generation, arguably due to the inherent locality of their denoising mechanisms. This can yield samples that are locally plausible but globally inconsistent. To mitigate this issue, we propose sparsely supervised learning for diffusion models, a simple yet effective masking strategy that can be implemented with only a few lines of code. Interestingly, the experiments show that it is safe to mask up to 98\% of pixels during diffusion model training. Our method delivers competitive FID scores across experiments and, most importantly, avoids training instability on small datasets. Moreover, the masking strategy reduces memorization and promotes the use of essential contextual information during generation.