Speculative Jacobi-Denoising Decoding for Accelerating Autoregressive Text-to-image Generation

📅 2025-10-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Autoregressive text-to-image models suffer from slow inference due to sequential token decoding—requiring thousands of forward passes. To address this, we propose a parallel denoising decoding framework in the embedding space: modeling denoising as Jacobi iteration, enabling simultaneous prediction of multiple clean tokens from a noise-initialized sequence; integrating a next-slice prediction paradigm and denoising-guided iterative trajectory. Our method requires only lightweight fine-tuning of pretrained models and incorporates probabilistic validation with progressive refinement to ensure generation stability. Experiments demonstrate substantial reduction in forward-pass count and significant speedup in inference, while preserving visual fidelity. The core contribution is the first principled integration of denoising principles with Jacobi-style parallel iteration, enabling efficient, stable, and transferable parallel decoding for autoregressive models.

Technology Category

Application Category

📝 Abstract
As a new paradigm of visual content generation, autoregressive text-to-image models suffer from slow inference due to their sequential token-by-token decoding process, often requiring thousands of model forward passes to generate a single image. To address this inefficiency, we propose Speculative Jacobi-Denoising Decoding (SJD2), a framework that incorporates the denoising process into Jacobi iterations to enable parallel token generation in autoregressive models. Our method introduces a next-clean-token prediction paradigm that enables the pre-trained autoregressive models to accept noise-perturbed token embeddings and predict the next clean tokens through low-cost fine-tuning. This denoising paradigm guides the model towards more stable Jacobi trajectories. During inference, our method initializes token sequences with Gaussian noise and performs iterative next-clean-token-prediction in the embedding space. We employ a probabilistic criterion to verify and accept multiple tokens in parallel, and refine the unaccepted tokens for the next iteration with the denoising trajectory. Experiments show that our method can accelerate generation by reducing model forward passes while maintaining the visual quality of generated images.
Problem

Research questions and friction points this paper is trying to address.

Accelerates slow autoregressive image generation
Enables parallel token prediction via denoising
Reduces model forward passes while maintaining quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

Incorporates denoising into Jacobi iterations for parallel generation
Uses next-clean-token prediction with noise-perturbed embeddings
Verifies and accepts multiple tokens in parallel during inference