Efficient Generative Modeling with Residual Vector Quantization-Based Tokens

📅 2024-12-13

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

To address the sampling speed degradation caused by increasing quantization depth in discrete diffusion models, this paper proposes ResGen—a computationally efficient discrete diffusion framework based on residual vector quantization (RVQ). Its core innovation is a collective token embedding prediction mechanism: instead of autoregressively generating RVQ codebook indices layer-by-layer, ResGen jointly predicts vector embeddings across multiple RVQ levels within a single denoising step. Furthermore, token masking and multi-token joint prediction are rigorously integrated into the variational inference and discrete diffusion probabilistic modeling framework. This design achieves, for the first time, simultaneous improvement in RVQ depth scalability, generation quality, and sampling efficiency. Experiments demonstrate that ResGen outperforms autoregressive baselines on both conditional ImageNet 256×256 image generation and zero-shot text-to-speech synthesis—maintaining high fidelity without compromising sampling speed—and consistently surpasses same-scale benchmarks as RVQ depth increases.

Technology Category

Application Category

📝 Abstract

We explore the use of Residual Vector Quantization (RVQ) for high-fidelity generation in vector-quantized generative models. This quantization technique maintains higher data fidelity by employing more in-depth tokens. However, increasing the token number in generative models leads to slower inference speeds. To this end, we introduce ResGen, an efficient RVQ-based discrete diffusion model that generates high-fidelity samples without compromising sampling speed. Our key idea is a direct prediction of vector embedding of collective tokens rather than individual ones. Moreover, we demonstrate that our proposed token masking and multi-token prediction method can be formulated within a principled probabilistic framework using a discrete diffusion process and variational inference. We validate the efficacy and generalizability of the proposed method on two challenging tasks across different modalities: conditional image generation} on ImageNet 256x256 and zero-shot text-to-speech synthesis. Experimental results demonstrate that ResGen outperforms autoregressive counterparts in both tasks, delivering superior performance without compromising sampling speed. Furthermore, as we scale the depth of RVQ, our generative models exhibit enhanced generation fidelity or faster sampling speeds compared to similarly sized baseline models. The project page can be found at https://resgen-genai.github.io

Problem

Research questions and friction points this paper is trying to address.

Improving high-fidelity generation with fast sampling

Reducing inference steps independent of RVQ depth

Enhancing generation fidelity or sampling speed

Innovation

Methods, ideas, or system contributions that make the work stand out.

ResGen uses Residual Vector Quantization (RVQ)

Direct prediction of collective token embeddings

Token masking with discrete diffusion framework

🔎 Similar Papers

No similar papers found.