🤖 AI Summary
Current image generation models struggle to balance computational efficiency and generation quality: VAEs suffer from information loss and limited end-to-end trainability; pixel-space diffusion models incur high computational overhead; and cascaded architectures face distribution mismatch, knowledge fragmentation, and difficulties in joint optimization due to their staged design. To address these limitations, we propose a unified multi-stage diffusion framework grounded in conditional dependency coupling. Our approach models image generation as a multi-step interpolation trajectory and implements it via a single Diffusion Transformer that enables cross-stage parameter sharing and end-to-end joint optimization. Leveraging stochastic interpolation and conditional coupling, the framework performs multi-scale modeling directly in pixel space. Experiments demonstrate that our method achieves high-fidelity generation across diverse resolutions while maintaining efficient inference—significantly outperforming state-of-the-art VAE- and cascade-based systems.
📝 Abstract
Existing image generation models face critical challenges regarding the trade-off between computation and fidelity. Specifically, models relying on a pretrained Variational Autoencoder (VAE) suffer from information loss, limited detail, and the inability to support end-to-end training. In contrast, models operating directly in the pixel space incur prohibitive computational cost. Although cascade models can mitigate computational cost, stage-wise separation prevents effective end-to-end optimization, hampers knowledge sharing, and often results in inaccurate distribution learning within each stage. To address these challenges, we introduce a unified multistage generative framework based on our proposed Conditional Dependent Coupling strategy. It decomposes the generative process into interpolant trajectories at multiple stages, ensuring accurate distribution learning while enabling end-to-end optimization. Importantly, the entire process is modeled as a single unified Diffusion Transformer, eliminating the need for disjoint modules and also enabling knowledge sharing. Extensive experiments demonstrate that our method achieves both high fidelity and efficiency across multiple resolutions.