SJD++: Improved Speculative Jacobi Decoding for Training-free Acceleration of Discrete Auto-regressive Text-to-Image Generation

📅 2025-12-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large autoregressive text-to-image models suffer from high inference latency and low throughput due to their reliance on hundreds to thousands of sequential sampling steps. This paper proposes SJD++, a training-free parallel decoding algorithm that synergistically integrates Jacobi-style multi-token prediction with speculative sampling verification: it generates probabilistic draft token sequences using the base model and concurrently validates them via a lightweight verifier to identify and reuse high-confidence tokens, thereby eliminating redundant computation. SJD++ preserves image fidelity while achieving 2–3× end-to-end latency reduction and 2–7× step compression. It is validated across multiple state-of-the-art models—including Stable Diffusion and PixArt-α—spanning both purely autoregressive and diffusion-autoregressive hybrid architectures. The method establishes an efficient, general-purpose, and training-free acceleration paradigm for high-resolution text-to-image generation.

Technology Category

Application Category

📝 Abstract
Large autoregressive models can generate high-quality, high-resolution images but suffer from slow generation speed, because these models require hundreds to thousands of sequential forward passes for next-token prediction during inference. To accelerate autoregressive text-to-image generation, we propose Speculative Jacobi Decoding++ (SJD++), a training-free probabilistic parallel decoding algorithm. Unlike traditional next-token prediction, SJD++ performs multi-token prediction in each forward pass, drastically reducing generation steps. Specifically, it integrates the iterative multi-token prediction mechanism from Jacobi decoding, with the probabilistic drafting-and-verification mechanism from speculative sampling. More importantly, for further acceleration, SJD++ reuses high-confidence draft tokens after each verification phase instead of resampling them all. We conduct extensive experiments on several representative autoregressive text-to-image generation models and demonstrate that SJD++ achieves $2 imes$ to $3 imes$ inference latency reduction and $2 imes$ to $7 imes$ step compression, while preserving visual quality with no observable degradation.
Problem

Research questions and friction points this paper is trying to address.

Accelerates slow autoregressive text-to-image generation
Reduces sequential inference steps via parallel multi-token prediction
Maintains image quality while compressing generation latency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free probabilistic parallel decoding algorithm
Integrates Jacobi decoding with speculative sampling mechanisms
Reuses high-confidence draft tokens to accelerate generation