🤖 AI Summary
Diffusion models suffer from misaligned training and sampling objectives, information loss due to progressive noise addition, and difficulty incorporating high-level perceptual or adversarial losses. To address these bottlenecks, this paper proposes the first truly end-to-end diffusion framework—abandoning the conventional multi-step Markov denoising process in favor of directly learning a differentiable mapping from pure Gaussian noise to the target image. Key contributions include: (1) unifying training and sampling objectives to eliminate the train-inference gap; (2) introducing an implicit noise-to-data modeling architecture; (3) enabling joint optimization with perceptual and adversarial losses; and (4) completely avoiding information leakage inherent in iterative noise scheduling. Experiments on COCO30K and HW30K demonstrate significant improvements in FID and CLIP Score, while maintaining high-fidelity generation with only a few sampling steps.
📝 Abstract
Diffusion models have emerged as a powerful framework for generative modeling, achieving state-of-the-art performance across various tasks. However, they face several inherent limitations, including a training-sampling gap, information leakage in the progressive noising process, and the inability to incorporate advanced loss functions like perceptual and adversarial losses during training. To address these challenges, we propose an innovative end-to-end training framework that aligns the training and sampling processes by directly optimizing the final reconstruction output. Our method eliminates the training-sampling gap, mitigates information leakage by treating the training process as a direct mapping from pure noise to the target data distribution, and enables the integration of perceptual and adversarial losses into the objective. Extensive experiments on benchmarks such as COCO30K and HW30K demonstrate that our approach consistently outperforms traditional diffusion models, achieving superior results in terms of FID and CLIP score, even with reduced sampling steps. These findings highlight the potential of end-to-end training to advance diffusion-based generative models toward more robust and efficient solutions.