🤖 AI Summary
Diffusion models achieve high-fidelity generation but suffer from computationally intensive iterative denoising; moreover, existing retrieval-augmented approaches rely on external repositories and static vision-language models, incurring substantial storage/retrieval overhead and poor training adaptability. To address these limitations, we propose Prototype Diffusion Models (PDM), the first framework to embed *dynamic visual prototype learning* directly into the diffusion process. PDM constructs a compact, semantically coherent set of prototypes via contrastive learning over clean-image latent features, and aligns noisy representations with relevant prototypes during denoising—eliminating the need for external retrieval. By discarding static models and auxiliary storage, PDM significantly reduces computational and memory costs while enhancing training adaptability and scalability. Experiments demonstrate that PDM maintains generation quality competitive with state-of-the-art methods, offering a more efficient, lightweight, and end-to-end alternative to retrieval-augmented diffusion generation.
📝 Abstract
Diffusion models have emerged as a leading framework for high-quality image generation, offering stable training and strong performance across diverse domains. However, they remain computationally intensive, particularly during the iterative denoising process. Latent-space models like Stable Diffusion alleviate some of this cost by operating in compressed representations, though at the expense of fine-grained detail. More recent approaches such as Retrieval-Augmented Diffusion Models (RDM) address efficiency by conditioning denoising on similar examples retrieved from large external memory banks. While effective, these methods introduce drawbacks: they require costly storage and retrieval infrastructure, depend on static vision-language models like CLIP for similarity, and lack adaptability during training. We propose the Prototype Diffusion Model (PDM), a method that integrates prototype learning directly into the diffusion process for efficient and adaptive visual conditioning - without external memory. Instead of retrieving reference samples, PDM constructs a dynamic set of compact visual prototypes from clean image features using contrastive learning. These prototypes guide the denoising steps by aligning noisy representations with semantically relevant visual patterns, enabling efficient generation with strong semantic grounding. Experiments show that PDM maintains high generation quality while reducing computational and storage overhead, offering a scalable alternative to retrieval-based conditioning in diffusion models.