🤖 AI Summary
Denoising generative models (e.g., diffusion models) suffer from high training costs and inefficient representation learning. Existing discriminative alignment approaches rely on external pretrained encoders, incurring additional computational overhead and domain shift issues. This paper proposes the first encoder-free contrastive memory bank framework: it decouples negative sample size from batch size via a dynamically updated large-scale negative queue; integrates a low-dimensional projection head with the denoising objective to enable self-contained, zero-inference-overhead contrastive learning. The method significantly accelerates convergence—achieving FID=2.40 on ImageNet-256 within 400K steps—setting a new state-of-the-art at the time. It establishes a novel paradigm for efficient self-supervised representation learning in generative modeling, eliminating reliance on external architectures while maintaining end-to-end trainability and inference efficiency.
📝 Abstract
The dominance of denoising generative models (e.g., diffusion, flow-matching) in visual synthesis is tempered by their substantial training costs and inefficiencies in representation learning. While injecting discriminative representations via auxiliary alignment has proven effective, this approach still faces key limitations: the reliance on external, pre-trained encoders introduces overhead and domain shift. A dispersed-based strategy that encourages strong separation among in-batch latent representations alleviates this specific dependency. To assess the effect of the number of negative samples in generative modeling, we propose {mname}, a plug-and-play training framework that requires no external encoders. Our method integrates a memory bank mechanism that maintains a large, dynamically updated queue of negative samples across training iterations. This decouples the number of negatives from the mini-batch size, providing abundant and high-quality negatives for a contrastive objective without a multiplicative increase in computational cost. A low-dimensional projection head is used to further minimize memory and bandwidth overhead. {mname} offers three principal advantages: (1) it is self-contained, eliminating dependency on pretrained vision foundation models and their associated forward-pass overhead; (2) it introduces no additional parameters or computational cost during inference; and (3) it enables substantially faster convergence, achieving superior generative quality more efficiently. On ImageNet-256, {mname} achieves a state-of-the-art FID of extbf{2.40} within 400k steps, significantly outperforming comparable methods.