DDAE++: Enhancing Diffusion Models Towards Unified Generative and Discriminative Learning

📅 2025-05-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the fundamental disconnect between generative capability and discriminative representation learning in diffusion models. To bridge this gap, we propose a self-conditioning mechanism that leverages semantic representations extracted from the denoising network’s own outputs to guide its decoding process, thereby constructing a semantically compact bottleneck for joint optimization of generation and representation learning. Methodologically, we introduce contrastive self-distillation into the diffusion training framework for the first time—enabling seamless, end-to-end optimization in both pixel and latent spaces while maintaining full compatibility with state-of-the-art architectures such as UViT and DiT. Experiments demonstrate substantial improvements: FID scores drop significantly; linear evaluation accuracy surpasses leading self-supervised methods including SimCLR and DINO; computational overhead increases by only 1%; and the approach exhibits strong architectural generalizability—effectively transcending the traditional generative-discriminative dichotomy.

Technology Category

Application Category

📝 Abstract
While diffusion models have gained prominence in image synthesis, their generative pre-training has been shown to yield discriminative representations, paving the way towards unified visual generation and understanding. However, two key questions remain: 1) Can these representations be leveraged to improve the training of diffusion models themselves, rather than solely benefiting downstream tasks? 2) Can the feature quality be enhanced to rival or even surpass modern self-supervised learners, without compromising generative capability? This work addresses these questions by introducing self-conditioning, a straightforward yet effective mechanism that internally leverages the rich semantics inherent in denoising network to guide its own decoding layers, forming a tighter bottleneck that condenses high-level semantics to improve generation. Results are compelling: our method boosts both generation FID and recognition accuracy with 1% computational overhead and generalizes across diverse diffusion architectures. Crucially, self-conditioning facilitates an effective integration of discriminative techniques, such as contrastive self-distillation, directly into diffusion models without sacrificing generation quality. Extensive experiments on pixel-space and latent-space datasets show that in linear evaluations, our enhanced diffusion models, particularly UViT and DiT, serve as strong representation learners, surpassing various self-supervised models.
Problem

Research questions and friction points this paper is trying to address.

Enhancing diffusion models for unified generative and discriminative learning
Improving feature quality without compromising generative capability
Integrating discriminative techniques into diffusion models effectively
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-conditioning mechanism enhances diffusion models
Integrates discriminative techniques without quality loss
Boosts generation and recognition with minimal overhead
🔎 Similar Papers
No similar papers found.
Weilai Xiang
Weilai Xiang
Beihang University
Computer VisionGenerative ModelsRepresentation Learning
H
Hongyu Yang
State Key Laboratory of Virtual Reality Technology and Systems, Beihang University; Institute of Artificial Intelligence, Beihang University
D
Di Huang
School of Computer Science and Engineering, Beihang University
Yunhong Wang
Yunhong Wang
Professor, School of Computer Science and Engineering, Beihang University
BiometricsPattern RecognitionImage ProcessingComputer Vision