🤖 AI Summary
This work investigates the necessity of explicit time conditioning in diffusion generative models, demonstrating that its absence leads to performance degradation under deterministic sampling schemes such as DDIM. Through a geometric analysis of the forward diffusion process, the authors reveal that high-dimensional noisy data concentrate on low-dimensional class-specific hypercylindrical manifolds. Leveraging this insight, they propose a high-quality generation mechanism that eliminates the need for explicit time conditioning by decoupling the manifold structure to align with the noise evolution trajectory prescribed by flow matching. The framework is further extended to class-conditional generation by mapping class information into an independent temporal space, enabling unconditional denoising models to perform controllable synthesis. Experimental results validate the effectiveness of the proposed approach in both image generation quality and conditional control.
📝 Abstract
Practically, training diffusion models typically requires explicit time conditioning to guide the network through the denoising sampling process. Especially in deterministic methods like DDIM, the absence of time conditioning leads to significant performance degradation. However, other deterministic sampling approaches, such as flow matching, can generate high-quality content without this conditioning, raising the question of its necessity. In this work, we revisit the role of time conditioning from a geometric perspective. We analyze the evolution of noisy data distributions under the forward diffusion process and demonstrate that, in high-dimensional spaces, these distributions concentrate on low-dimensional hyper-cylinder-like manifolds embedded within the input space. Successful generation, we argue, stems from the disentanglement of these manifolds in high-dimensional space. Based on this insight, we modify the forward process of DDIM to align the noisy data manifold with the flow-matching approach, proving that DDIM can generate high-quality content without time conditioning, provided the noisy manifold evolves according to the flow-matching method. Additionally, we extend our framework to class-conditioned generation by decoupling classes into distinct time spaces, enabling class-conditioned synthesis with a class-unconditional denoising model. Extensive experiments validate our theoretical analysis and show that high-quality generation is achievable without explicit conditional embeddings.