D2C: Unlocking the Potential of Continuous Autoregressive Image Generation with Discrete Tokens

📅 2025-03-21

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

In image generation, discrete token autoregressive models scale well but yield low fidelity, whereas continuous token diffusion models achieve high quality at the cost of computational inefficiency; existing hybrid approaches fail to exploit the synergistic potential of jointly modeling discrete and continuous tokens. This paper proposes D2C, a two-stage discrete-to-continuous framework: first, a lightweight discrete generator produces coarse-grained tokens; second, a continuous decoder conditions on these tokens to synthesize fine-grained continuous representations—thereby unifying the flexibility of discrete modeling with the high-fidelity expressiveness of continuous representations. We introduce a novel cross-domain feature fusion module enabling seamless interaction between discrete and continuous tokens. On ImageNet-256 class-conditional generation, D2C substantially outperforms state-of-the-art discrete and continuous baselines, achieving the optimal trade-off between sample fidelity and inference efficiency.

Technology Category

Application Category

📝 Abstract

In the domain of image generation, latent-based generative models occupy a dominant status; however, these models rely heavily on image tokenizer. To meet modeling requirements, autoregressive models possessing the characteristics of scalability and flexibility embrace a discrete-valued tokenizer, but face the challenge of poor image generation quality. In contrast, diffusion models take advantage of the continuous-valued tokenizer to achieve better generation quality but are subject to low efficiency and complexity. The existing hybrid models are mainly to compensate for information loss and simplify the diffusion learning process. The potential of merging discrete-valued and continuous-valued tokens in the field of image generation has not yet been explored. In this paper, we propose D2C, a novel two-stage method to enhance model generation capacity. In the first stage, the discrete-valued tokens representing coarse-grained image features are sampled by employing a small discrete-valued generator. Then in the second stage, the continuous-valued tokens representing fine-grained image features are learned conditioned on the discrete token sequence. In addition, we design two kinds of fusion modules for seamless interaction. On the ImageNet-256 benchmark, extensive experiment results validate that our model achieves superior performance compared with several continuous-valued and discrete-valued generative models on the class-conditional image generation tasks.

Problem

Research questions and friction points this paper is trying to address.

Improving image generation quality in autoregressive models with discrete tokens

Combining discrete and continuous tokens for enhanced image generation

Addressing efficiency and complexity issues in diffusion-based image generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage method combining discrete and continuous tokens

Small discrete generator for coarse-grained features

Fusion modules for seamless token interaction

🔎 Similar Papers

Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining