Compositional Discrete Latent Code for High Fidelity, Productive Diffusion Models

📅 2025-07-16

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

This work addresses the lack of compositional structure in conditional representations of diffusion models, which hinders generalization to out-of-distribution (OOD) samples. We propose Discrete Latent Codes (DLC), a self-supervised discrete image representation learned via simplex embedding. DLC encodes images into semantically compositional discrete token sequences, achieving high-fidelity reconstruction while significantly improving generation efficiency. Its core innovation is the first explicitly compositional discrete image representation, enabling unconditional high-quality synthesis, OOD novel image generation, and efficient text-to-image generation when integrated with large language models. On ImageNet, DLC achieves state-of-the-art performance in unconditional image generation. Empirical results validate its rationality for cross-distribution generation and effectiveness for text-guided synthesis.

Technology Category

Application Category

📝 Abstract

We argue that diffusion models' success in modeling complex distributions is, for the most part, coming from their input conditioning. This paper investigates the representation used to condition diffusion models from the perspective that ideal representations should improve sample fidelity, be easy to generate, and be compositional to allow out-of-training samples generation. We introduce Discrete Latent Code (DLC), an image representation derived from Simplicial Embeddings trained with a self-supervised learning objective. DLCs are sequences of discrete tokens, as opposed to the standard continuous image embeddings. They are easy to generate and their compositionality enables sampling of novel images beyond the training distribution. Diffusion models trained with DLCs have improved generation fidelity, establishing a new state-of-the-art for unconditional image generation on ImageNet. Additionally, we show that composing DLCs allows the image generator to produce out-of-distribution samples that coherently combine the semantics of images in diverse ways. Finally, we showcase how DLCs can enable text-to-image generation by leveraging large-scale pretrained language models. We efficiently finetune a text diffusion language model to generate DLCs that produce novel samples outside of the image generator training distribution.

Problem

Research questions and friction points this paper is trying to address.

Improving sample fidelity in diffusion models

Enabling generation of out-of-training samples

Facilitating text-to-image generation via DLCs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Discrete Latent Code for compositional image representation

Self-supervised Simplicial Embeddings for discrete tokens

Text-to-image via finetuned DLC-generating language models

🔎 Similar Papers

Diffusion Models: A Comprehensive Survey of Methods and Applications