D2C: Unlocking the Potential of Continuous Autoregressive Image Generation with Discrete Tokens

📅 2025-03-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In image generation, discrete token autoregressive models scale well but yield low fidelity, whereas continuous token diffusion models achieve high quality at the cost of computational inefficiency; existing hybrid approaches fail to exploit the synergistic potential of jointly modeling discrete and continuous tokens. This paper proposes D2C, a two-stage discrete-to-continuous framework: first, a lightweight discrete generator produces coarse-grained tokens; second, a continuous decoder conditions on these tokens to synthesize fine-grained continuous representations—thereby unifying the flexibility of discrete modeling with the high-fidelity expressiveness of continuous representations. We introduce a novel cross-domain feature fusion module enabling seamless interaction between discrete and continuous tokens. On ImageNet-256 class-conditional generation, D2C substantially outperforms state-of-the-art discrete and continuous baselines, achieving the optimal trade-off between sample fidelity and inference efficiency.

Technology Category

Application Category

📝 Abstract
In the domain of image generation, latent-based generative models occupy a dominant status; however, these models rely heavily on image tokenizer. To meet modeling requirements, autoregressive models possessing the characteristics of scalability and flexibility embrace a discrete-valued tokenizer, but face the challenge of poor image generation quality. In contrast, diffusion models take advantage of the continuous-valued tokenizer to achieve better generation quality but are subject to low efficiency and complexity. The existing hybrid models are mainly to compensate for information loss and simplify the diffusion learning process. The potential of merging discrete-valued and continuous-valued tokens in the field of image generation has not yet been explored. In this paper, we propose D2C, a novel two-stage method to enhance model generation capacity. In the first stage, the discrete-valued tokens representing coarse-grained image features are sampled by employing a small discrete-valued generator. Then in the second stage, the continuous-valued tokens representing fine-grained image features are learned conditioned on the discrete token sequence. In addition, we design two kinds of fusion modules for seamless interaction. On the ImageNet-256 benchmark, extensive experiment results validate that our model achieves superior performance compared with several continuous-valued and discrete-valued generative models on the class-conditional image generation tasks.
Problem

Research questions and friction points this paper is trying to address.

Improving image generation quality in autoregressive models with discrete tokens
Combining discrete and continuous tokens for enhanced image generation
Addressing efficiency and complexity issues in diffusion-based image generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage method combining discrete and continuous tokens
Small discrete generator for coarse-grained features
Fusion modules for seamless token interaction
🔎 Similar Papers
No similar papers found.
P
Panpan Wang
Pattern Recognition Center, WeChat AI, Tencent Inc, China
Liqiang Niu
Liqiang Niu
WeChat AI, Tencent
natural language processingmachine learningdeep learning
Fandong Meng
Fandong Meng
WeChat AI, Tencent
Machine TranslationNatural Language Processing
Jinan Xu
Jinan Xu
Professor of School of Computer and Information Technology, Beijing Jiaotong University
NLPMachine TranslationLLM
Y
Yufeng Chen
Beijing Key Laboratory of Traffic Data Mining and Embodied Intelligence, Beijing Jiaotong University, Beijing, China
J
Jie Zhou
Pattern Recognition Center, WeChat AI, Tencent Inc, China