Schedule On the Fly: Diffusion Time Prediction for Faster and Better Image Generation

📅 2024-12-02
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing diffusion models employ a uniform denoising strategy across all text prompts, resulting in a trade-off between generation quality and inference efficiency. To address this, we propose an adaptive noise scheduling mechanism that— for the first time—formulates the joint optimization of denoising step count and noise level as a sequential decision-making problem. We design a Temporal Prediction Module (TPM) based on Proximal Policy Optimization (PPO) reinforcement learning, enabling step-wise, latent-feature-driven dynamic scheduling. The method is plug-and-play and compatible with mainstream diffusion architectures, including Stable Diffusion 3 Medium. Experiments demonstrate that our approach reduces denoising steps by approximately 50% while achieving higher aesthetic scores (5.44) and human preference scores (29.59), thereby significantly improving both generation efficiency and image fidelity.

Technology Category

Application Category

📝 Abstract
Diffusion and flow models have achieved remarkable successes in various applications such as text-to-image generation. However, these models typically rely on the same predetermined denoising schedules during inference for each prompt, which potentially limits the inference efficiency as well as the flexibility when handling different prompts. In this paper, we argue that the optimal noise schedule should adapt to each inference instance, and introduce the Time Prediction Diffusion Model (TPDM) to accomplish this. TPDM employs a plug-and-play Time Prediction Module (TPM) that predicts the next noise level based on current latent features at each denoising step. We train the TPM using reinforcement learning, aiming to maximize a reward that discounts the final image quality by the number of denoising steps. With such an adaptive scheduler, TPDM not only generates high-quality images that are aligned closely with human preferences but also adjusts the number of denoising steps and time on the fly, enhancing both performance and efficiency. We train TPDMs on multiple diffusion model benchmarks. With Stable Diffusion 3 Medium architecture, TPDM achieves an aesthetic score of 5.44 and a human preference score (HPS) of 29.59, while using around 50% fewer denoising steps to achieve better performance. We will release our best model alongside this paper.
Problem

Research questions and friction points this paper is trying to address.

Diffusion Models
Image Generation
Adaptive Denoising Strategies
Innovation

Methods, ideas, or system contributions that make the work stand out.

Time Predictive Diffusion Model
Dynamic Noise Adjustment
Efficiency and Quality Unification
Z
Zilyu Ye
MAPLE Lab, Westlake University; South China University of Technology
Z
Zhiyang Chen
MAPLE Lab, Westlake University; Institute of Advanced Technology, Westlake Institute for Advanced Study
Tiancheng Li
Tiancheng Li
Westlake University, Zhejiang University
image generationmultimodal generationreinforcement learning
Zemin Huang
Zemin Huang
PhD student, Westlake University, Zhejiang University
Diffusion ModelAutoregressive ModelDiffusion Distillation
Weijian Luo
Weijian Luo
Peking University
Human-preferred Generative ModelsLarge Vision-language Models
G
Guo-Jun Qi
MAPLE Lab, Westlake University; Institute of Advanced Technology, Westlake Institute for Advanced Study