🤖 AI Summary
The relative difficulty of denoising tasks across timesteps in diffusion models remains controversial. Method: This work systematically quantifies denoising difficulty per timestep, leveraging both the convergence behavior of denoising error and the relative entropy between true and predicted distributions—revealing that early (low-timestep) denoising is significantly more challenging. Building on this insight, we propose a “curriculum learning” paradigm: timesteps are clustered by difficulty and trained progressively in stages, with joint optimization of the noise schedule. Contribution/Results: Our approach departs from conventional parallel full-timestep training, requiring no architectural or loss-function modifications and remaining compatible with diverse diffusion model enhancements. Extensive experiments on unconditional generation, class-conditional generation, and text-to-image synthesis demonstrate substantial improvements in both model performance and convergence speed.
📝 Abstract
Diffusion-based generative models have emerged as powerful tools in the realm of generative modeling. Despite extensive research on denoising across various timesteps and noise levels, a conflict persists regarding the relative difficulties of the denoising tasks. While various studies argue that lower timesteps present more challenging tasks, others contend that higher timesteps are more difficult. To address this conflict, our study undertakes a comprehensive examination of task difficulties, focusing on convergence behavior and changes in relative entropy between consecutive probability distributions across timesteps. Our observational study reveals that denoising at earlier timesteps poses challenges characterized by slower convergence and higher relative entropy, indicating increased task difficulty at these lower timesteps. Building on these observations, we introduce an easy-to-hard learning scheme, drawing from curriculum learning, to enhance the training process of diffusion models. By organizing timesteps or noise levels into clusters and training models with ascending orders of difficulty, we facilitate an order-aware training regime, progressing from easier to harder denoising tasks, thereby deviating from the conventional approach of training diffusion models simultaneously across all timesteps. Our approach leads to improved performance and faster convergence by leveraging benefits of curriculum learning, while maintaining orthogonality with existing improvements in diffusion training techniques. We validate these advantages through comprehensive experiments in image generation tasks, including unconditional, class-conditional, and text-to-image generation.