On Designing Diffusion Autoencoders for Efficient Generation and Representation Learning

📅 2025-05-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Diffusion autoencoders (DAs) suffer from two key limitations: (1) generative performance is constrained by the fidelity of latent-variable modeling, and (2) jointly optimizing terminal constraints and input-dependent noise scheduling remains challenging. To address these, we propose DMZ—the first DA framework integrating a learnable, input-dependent forward diffusion process, enabling joint optimization of representation learning and generation efficiency under a unified objective. Our contributions include: (i) establishing a theoretical connection between DAs and learnable forward processes; (ii) designing a latent-variable selection mechanism and conditional modeling strategy to enhance domain transferability and enable few-step generation. Experiments demonstrate that DMZ significantly improves representation quality on downstream classification and cross-domain transfer tasks, while achieving high-fidelity generation in only 4–8 denoising steps—outperforming both standard diffusion models and VAE-based baselines across all metrics.

Technology Category

Application Category

📝 Abstract
Diffusion autoencoders (DAs) are variants of diffusion generative models that use an input-dependent latent variable to capture representations alongside the diffusion process. These representations, to varying extents, can be used for tasks such as downstream classification, controllable generation, and interpolation. However, the generative performance of DAs relies heavily on how well the latent variables can be modelled and subsequently sampled from. Better generative modelling is also the primary goal of another class of diffusion models -- those that learn their forward (noising) process. While effective at adjusting the noise process in an input-dependent manner, they must satisfy additional constraints derived from the terminal conditions of the diffusion process. Here, we draw a connection between these two classes of models and show that certain design decisions (latent variable choice, conditioning method, etc.) in the DA framework -- leading to a model we term DMZ -- allow us to obtain the best of both worlds: effective representations as evaluated on downstream tasks, including domain transfer, as well as more efficient modelling and generation with fewer denoising steps compared to standard DMs.
Problem

Research questions and friction points this paper is trying to address.

Designing diffusion autoencoders for efficient generation and representation learning
Improving latent variable modeling for better generative performance
Balancing representation quality and generation efficiency in diffusion models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion autoencoders with input-dependent latent variables
Combining effective representations and efficient generation
DMZ model optimizes latent variables and conditioning